You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by raghu vittal <rr...@live.com> on 2016/02/19 10:37:49 UTC

Unable to extract content from chunked portion of large file

Hi All

we have very large PDF,.docx,.xlsx. We are using Tika to extract content and dump data in Elastic Search for full-text search.
sending very large files to Tika will cause out of memory exception.

we want to chunk the file and send it to TIKA for content extraction. when we passed chunked portion of file to Tika it is giving empty text.
I assume Tika is relied on file structure that why it is not giving any content.

we are using Tika Server(REST api) in our .net application.

please suggest us better approach for this scenario.

Regards,
Raghu.


RE: Unable to extract content from chunked portion of large file

Posted by Ken Krugler <kk...@transpac.com>.
Hi Sergey,

Thanks for digging into the code - I'd seen the docs and assumed it wouldn't work.

Anybody have a chance to give that a try? Maybe Raghu? :)

-- Ken

> From: Sergey Beryozkin
> Sent: February 24, 2016 7:44:13am PST
> To: user@tika.apache.org
> Subject: Re: Unable to extract content from chunked portion of large file
> 
> Hi All
> 
> If a large file is passed to a Tika server as a multipart/form payload
> 
> then CXF will be creating a temp file on the disk itself.
> 
> Hmm... I was looking for a reference to it and I found the advice not to
> use multipart/form-data:
> https://wiki.apache.org/tika/TikaJAXRS (in Services)
> 
> I believe it should be removed, see:
> 
> http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java, example:
> 
> @POST
>    @Consumes("multipart/form-data")
>    @Produces("text/plain")
>    @Path("form")
>    public StreamingOutput getTextFromMultipart(Attachment att, @Context final UriInfo info) {
>        return produceText(att.getObject(InputStream.class), att.getHeaders(), info);
>    }
> 
> 
> Cheers, Sergey
> 
> 
> 
> On 24/02/16 15:37, Ken Krugler wrote:
>> Hi Raghu,
>> 
>> I don't think you understood what I was proposing.
>> 
>> I suggested creating a service that could receive chunks of the file
>> (persisted to local disk). Then this service could implement an input
>> stream class that would read sequentially from these pieces. This input
>> stream would be passed to Tika, thus giving Tika a single continuous
>> stream of data to the entire file content.
>> 
>> -- Ken
>> 
>>> ------------------------------------------------------------------------
>>> 
>>> *From:* raghu vittal
>>> 
>>> *Sent:* February 24, 2016 4:32:01am PST
>>> 
>>> *To:* user@tika.apache.org <ma...@tika.apache.org>
>>> 
>>> *Subject:* Re: Unable to extract content from chunked portion of large
>>> file
>>> 
>>> 
>>> Thanks for your reply.
>>> 
>>> In our application user can upload large files. Our intention is to
>>> extract the content out of large file and dump that in Elastic for
>>> contented based search.
>>> we have > 300 MB size .xlsx and .doc files. sending that large file to
>>> Tika will causing timeout issues.
>>> 
>>> i tried getting chunk of file and pass to Tika. Tika given me invalid
>>> data exception.
>>> 
>>> I Think for Tika we need to pass entire file at once to extract content.
>>> 
>>> Raghu.
>>> 
>>> ------------------------------------------------------------------------
>>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>>> <ma...@transpac.com>>
>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>> *Subject:*RE: Unable to extract content from chunked portion of large file
>>> One option is to create your own RESTful API that lets you send chunks
>>> of the file, and then you can provide an input stream that provides
>>> the seamless data view of the chunks to Tika (which is what it needs).
>>> 
>>> -- Ken
>>> 
>>>> ------------------------------------------------------------------------
>>>> *From:*raghu vittal
>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>> *Subject:*Unable to extract content from chunked portion of large file
>>>> 
>>>> Hi All
>>>> 
>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>> content and dump data in Elastic Search for full-text search.
>>>> sending very large files to Tika will cause out of memory exception.
>>>> 
>>>> we want to chunk the file and send it to TIKA for content extraction.
>>>> when we passed chunked portion of file to Tika it is giving empty text.
>>>> I assume Tika is relied on file structure that why it is not giving
>>>> any content.
>>>> 
>>>> we are using Tika Server(REST api) in our .net application.
>>>> 
>>>> please suggest us better approach for this scenario.
>>>> 
>>>> Regards,
>>>> Raghu.





--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: Unable to extract content from chunked portion of large file

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
yayyy!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Sergey Beryozkin <sb...@gmail.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Wednesday, February 24, 2016 at 9:04 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: Re: Unable to extract content from chunked portion of large file

>Time to start contributing to Tika again :-)
>
>Cheers, Sergey
>On 24/02/16 17:01, Mattmann, Chris A (3980) wrote:
>> thanks mucho my friend
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Sergey Beryozkin <sb...@gmail.com>
>> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
>> Date: Wednesday, February 24, 2016 at 8:58 AM
>> To: "user@tika.apache.org" <us...@tika.apache.org>
>> Subject: Re: Unable to extract content from chunked portion of large
>>file
>>
>>> Hi Chris
>>>
>>> Sure, I've opened
>>> https://issues.apache.org/jira/browse/TIKA-1871
>>>
>>> and assigned to myself, will add some info about multipart/form-data
>>>asap
>>>
>>> Cheers, Sergey
>>>
>>>
>>> On 24/02/16 16:40, Mattmann, Chris A (3980) wrote:
>>>> +1 please just remove it from the wiki since it clearly supports
>>>> that per your research thanks Sergey!
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattmann@nasa.gov
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Sergey Beryozkin <sb...@gmail.com>
>>>> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
>>>> Date: Wednesday, February 24, 2016 at 7:44 AM
>>>> To: "user@tika.apache.org" <us...@tika.apache.org>
>>>> Subject: Re: Unable to extract content from chunked portion of large
>>>> file
>>>>
>>>>> Hi All
>>>>>
>>>>> If a large file is passed to a Tika server as a multipart/form
>>>>>payload
>>>>>
>>>>> then CXF will be creating a temp file on the disk itself.
>>>>>
>>>>> Hmm... I was looking for a reference to it and I found the advice not
>>>>> to
>>>>> use multipart/form-data:
>>>>> https://wiki.apache.org/tika/TikaJAXRS (in Services)
>>>>>
>>>>> I believe it should be removed, see:
>>>>>
>>>>>
>>>>> 
>>>>>http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/o
>>>>>rg
>>>>> /a
>>>>> pache/tika/server/resource/TikaResource.java,
>>>>> example:
>>>>>
>>>>> @POST
>>>>>       @Consumes("multipart/form-data")
>>>>>       @Produces("text/plain")
>>>>>       @Path("form")
>>>>>       public StreamingOutput getTextFromMultipart(Attachment att,
>>>>> @Context final UriInfo info) {
>>>>>           return produceText(att.getObject(InputStream.class),
>>>>> att.getHeaders(), info);
>>>>>       }
>>>>>
>>>>>
>>>>> Cheers, Sergey
>>>>>
>>>>>
>>>>>
>>>>> On 24/02/16 15:37, Ken Krugler wrote:
>>>>>> Hi Raghu,
>>>>>>
>>>>>> I don't think you understood what I was proposing.
>>>>>>
>>>>>> I suggested creating a service that could receive chunks of the file
>>>>>> (persisted to local disk). Then this service could implement an
>>>>>>input
>>>>>> stream class that would read sequentially from these pieces. This
>>>>>> input
>>>>>> stream would be passed to Tika, thus giving Tika a single continuous
>>>>>> stream of data to the entire file content.
>>>>>>
>>>>>> -- Ken
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>--------------------------------------------------------------------
>>>>>>>--
>>>>>>> --
>>>>>>>
>>>>>>> *From:* raghu vittal
>>>>>>>
>>>>>>> *Sent:* February 24, 2016 4:32:01am PST
>>>>>>>
>>>>>>> *To:* user@tika.apache.org <ma...@tika.apache.org>
>>>>>>>
>>>>>>> *Subject:* Re: Unable to extract content from chunked portion of
>>>>>>> large
>>>>>>> file
>>>>>>>
>>>>>>>
>>>>>>> Thanks for your reply.
>>>>>>>
>>>>>>> In our application user can upload large files. Our intention is to
>>>>>>> extract the content out of large file and dump that in Elastic for
>>>>>>> contented based search.
>>>>>>> we have > 300 MB size .xlsx and .doc files. sending that large file
>>>>>>> to
>>>>>>> Tika will causing timeout issues.
>>>>>>>
>>>>>>> i tried getting chunk of file and pass to Tika. Tika given me
>>>>>>>invalid
>>>>>>> data exception.
>>>>>>>
>>>>>>> I Think for Tika we need to pass entire file at once to extract
>>>>>>> content.
>>>>>>>
>>>>>>> Raghu.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>--------------------------------------------------------------------
>>>>>>>--
>>>>>>> --
>>>>>>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>>>>>>> <ma...@transpac.com>>
>>>>>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>>>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>>>>> *Subject:*RE: Unable to extract content from chunked portion of
>>>>>>>large
>>>>>>> file
>>>>>>> One option is to create your own RESTful API that lets you send
>>>>>>> chunks
>>>>>>> of the file, and then you can provide an input stream that provides
>>>>>>> the seamless data view of the chunks to Tika (which is what it
>>>>>>> needs).
>>>>>>>
>>>>>>> -- Ken
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 
>>>>>>>>-------------------------------------------------------------------
>>>>>>>>--
>>>>>>>> --
>>>>>>>> -
>>>>>>>> *From:*raghu vittal
>>>>>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>>>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>>>>>> *Subject:*Unable to extract content from chunked portion of large
>>>>>>>> file
>>>>>>>>
>>>>>>>> Hi All
>>>>>>>>
>>>>>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>>>>>> content and dump data in Elastic Search for full-text search.
>>>>>>>> sending very large files to Tika will cause out of memory
>>>>>>>>exception.
>>>>>>>>
>>>>>>>> we want to chunk the file and send it to TIKA for content
>>>>>>>> extraction.
>>>>>>>> when we passed chunked portion of file to Tika it is giving empty
>>>>>>>> text.
>>>>>>>> I assume Tika is relied on file structure that why it is not
>>>>>>>>giving
>>>>>>>> any content.
>>>>>>>>
>>>>>>>> we are using Tika Server(REST api) in our .net application.
>>>>>>>>
>>>>>>>> please suggest us better approach for this scenario.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Raghu.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --------------------------
>>>>>> Ken Krugler
>>>>>> +1 530-210-6378
>>>>>> http://www.scaleunlimited.com
>>>>>> custom big data solutions & training
>>>>>> Hadoop, Cascading, Cassandra & Solr
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sergey Beryozkin
>>>>>
>>>>> Talend Community Coders
>>>>> http://coders.talend.com/
>>>>
>>>
>>
>


Re: Unable to extract content from chunked portion of large file

Posted by Sergey Beryozkin <sb...@gmail.com>.
Time to start contributing to Tika again :-)

Cheers, Sergey
On 24/02/16 17:01, Mattmann, Chris A (3980) wrote:
> thanks mucho my friend
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Sergey Beryozkin <sb...@gmail.com>
> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
> Date: Wednesday, February 24, 2016 at 8:58 AM
> To: "user@tika.apache.org" <us...@tika.apache.org>
> Subject: Re: Unable to extract content from chunked portion of large file
>
>> Hi Chris
>>
>> Sure, I've opened
>> https://issues.apache.org/jira/browse/TIKA-1871
>>
>> and assigned to myself, will add some info about multipart/form-data asap
>>
>> Cheers, Sergey
>>
>>
>> On 24/02/16 16:40, Mattmann, Chris A (3980) wrote:
>>> +1 please just remove it from the wiki since it clearly supports
>>> that per your research thanks Sergey!
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Sergey Beryozkin <sb...@gmail.com>
>>> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
>>> Date: Wednesday, February 24, 2016 at 7:44 AM
>>> To: "user@tika.apache.org" <us...@tika.apache.org>
>>> Subject: Re: Unable to extract content from chunked portion of large
>>> file
>>>
>>>> Hi All
>>>>
>>>> If a large file is passed to a Tika server as a multipart/form payload
>>>>
>>>> then CXF will be creating a temp file on the disk itself.
>>>>
>>>> Hmm... I was looking for a reference to it and I found the advice not
>>>> to
>>>> use multipart/form-data:
>>>> https://wiki.apache.org/tika/TikaJAXRS (in Services)
>>>>
>>>> I believe it should be removed, see:
>>>>
>>>>
>>>> http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org
>>>> /a
>>>> pache/tika/server/resource/TikaResource.java,
>>>> example:
>>>>
>>>> @POST
>>>>       @Consumes("multipart/form-data")
>>>>       @Produces("text/plain")
>>>>       @Path("form")
>>>>       public StreamingOutput getTextFromMultipart(Attachment att,
>>>> @Context final UriInfo info) {
>>>>           return produceText(att.getObject(InputStream.class),
>>>> att.getHeaders(), info);
>>>>       }
>>>>
>>>>
>>>> Cheers, Sergey
>>>>
>>>>
>>>>
>>>> On 24/02/16 15:37, Ken Krugler wrote:
>>>>> Hi Raghu,
>>>>>
>>>>> I don't think you understood what I was proposing.
>>>>>
>>>>> I suggested creating a service that could receive chunks of the file
>>>>> (persisted to local disk). Then this service could implement an input
>>>>> stream class that would read sequentially from these pieces. This
>>>>> input
>>>>> stream would be passed to Tika, thus giving Tika a single continuous
>>>>> stream of data to the entire file content.
>>>>>
>>>>> -- Ken
>>>>>
>>>>>>
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> --
>>>>>>
>>>>>> *From:* raghu vittal
>>>>>>
>>>>>> *Sent:* February 24, 2016 4:32:01am PST
>>>>>>
>>>>>> *To:* user@tika.apache.org <ma...@tika.apache.org>
>>>>>>
>>>>>> *Subject:* Re: Unable to extract content from chunked portion of
>>>>>> large
>>>>>> file
>>>>>>
>>>>>>
>>>>>> Thanks for your reply.
>>>>>>
>>>>>> In our application user can upload large files. Our intention is to
>>>>>> extract the content out of large file and dump that in Elastic for
>>>>>> contented based search.
>>>>>> we have > 300 MB size .xlsx and .doc files. sending that large file
>>>>>> to
>>>>>> Tika will causing timeout issues.
>>>>>>
>>>>>> i tried getting chunk of file and pass to Tika. Tika given me invalid
>>>>>> data exception.
>>>>>>
>>>>>> I Think for Tika we need to pass entire file at once to extract
>>>>>> content.
>>>>>>
>>>>>> Raghu.
>>>>>>
>>>>>>
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> --
>>>>>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>>>>>> <ma...@transpac.com>>
>>>>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>>>> *Subject:*RE: Unable to extract content from chunked portion of large
>>>>>> file
>>>>>> One option is to create your own RESTful API that lets you send
>>>>>> chunks
>>>>>> of the file, and then you can provide an input stream that provides
>>>>>> the seamless data view of the chunks to Tika (which is what it
>>>>>> needs).
>>>>>>
>>>>>> -- Ken
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> --
>>>>>>> -
>>>>>>> *From:*raghu vittal
>>>>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>>>>> *Subject:*Unable to extract content from chunked portion of large
>>>>>>> file
>>>>>>>
>>>>>>> Hi All
>>>>>>>
>>>>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>>>>> content and dump data in Elastic Search for full-text search.
>>>>>>> sending very large files to Tika will cause out of memory exception.
>>>>>>>
>>>>>>> we want to chunk the file and send it to TIKA for content
>>>>>>> extraction.
>>>>>>> when we passed chunked portion of file to Tika it is giving empty
>>>>>>> text.
>>>>>>> I assume Tika is relied on file structure that why it is not giving
>>>>>>> any content.
>>>>>>>
>>>>>>> we are using Tika Server(REST api) in our .net application.
>>>>>>>
>>>>>>> please suggest us better approach for this scenario.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Raghu.
>>>>>
>>>>>
>>>>>
>>>>> --------------------------
>>>>> Ken Krugler
>>>>> +1 530-210-6378
>>>>> http://www.scaleunlimited.com
>>>>> custom big data solutions & training
>>>>> Hadoop, Cascading, Cassandra & Solr
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Sergey Beryozkin
>>>>
>>>> Talend Community Coders
>>>> http://coders.talend.com/
>>>
>>
>


Re: Unable to extract content from chunked portion of large file

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
thanks mucho my friend

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Sergey Beryozkin <sb...@gmail.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Wednesday, February 24, 2016 at 8:58 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: Re: Unable to extract content from chunked portion of large file

>Hi Chris
>
>Sure, I've opened
>https://issues.apache.org/jira/browse/TIKA-1871
>
>and assigned to myself, will add some info about multipart/form-data asap
>
>Cheers, Sergey
>
>
>On 24/02/16 16:40, Mattmann, Chris A (3980) wrote:
>> +1 please just remove it from the wiki since it clearly supports
>> that per your research thanks Sergey!
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Sergey Beryozkin <sb...@gmail.com>
>> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
>> Date: Wednesday, February 24, 2016 at 7:44 AM
>> To: "user@tika.apache.org" <us...@tika.apache.org>
>> Subject: Re: Unable to extract content from chunked portion of large
>>file
>>
>>> Hi All
>>>
>>> If a large file is passed to a Tika server as a multipart/form payload
>>>
>>> then CXF will be creating a temp file on the disk itself.
>>>
>>> Hmm... I was looking for a reference to it and I found the advice not
>>>to
>>> use multipart/form-data:
>>> https://wiki.apache.org/tika/TikaJAXRS (in Services)
>>>
>>> I believe it should be removed, see:
>>>
>>> 
>>>http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org
>>>/a
>>> pache/tika/server/resource/TikaResource.java,
>>> example:
>>>
>>> @POST
>>>      @Consumes("multipart/form-data")
>>>      @Produces("text/plain")
>>>      @Path("form")
>>>      public StreamingOutput getTextFromMultipart(Attachment att,
>>> @Context final UriInfo info) {
>>>          return produceText(att.getObject(InputStream.class),
>>> att.getHeaders(), info);
>>>      }
>>>
>>>
>>> Cheers, Sergey
>>>
>>>
>>>
>>> On 24/02/16 15:37, Ken Krugler wrote:
>>>> Hi Raghu,
>>>>
>>>> I don't think you understood what I was proposing.
>>>>
>>>> I suggested creating a service that could receive chunks of the file
>>>> (persisted to local disk). Then this service could implement an input
>>>> stream class that would read sequentially from these pieces. This
>>>>input
>>>> stream would be passed to Tika, thus giving Tika a single continuous
>>>> stream of data to the entire file content.
>>>>
>>>> -- Ken
>>>>
>>>>>
>>>>> 
>>>>>----------------------------------------------------------------------
>>>>>--
>>>>>
>>>>> *From:* raghu vittal
>>>>>
>>>>> *Sent:* February 24, 2016 4:32:01am PST
>>>>>
>>>>> *To:* user@tika.apache.org <ma...@tika.apache.org>
>>>>>
>>>>> *Subject:* Re: Unable to extract content from chunked portion of
>>>>>large
>>>>> file
>>>>>
>>>>>
>>>>> Thanks for your reply.
>>>>>
>>>>> In our application user can upload large files. Our intention is to
>>>>> extract the content out of large file and dump that in Elastic for
>>>>> contented based search.
>>>>> we have > 300 MB size .xlsx and .doc files. sending that large file
>>>>>to
>>>>> Tika will causing timeout issues.
>>>>>
>>>>> i tried getting chunk of file and pass to Tika. Tika given me invalid
>>>>> data exception.
>>>>>
>>>>> I Think for Tika we need to pass entire file at once to extract
>>>>> content.
>>>>>
>>>>> Raghu.
>>>>>
>>>>>
>>>>> 
>>>>>----------------------------------------------------------------------
>>>>>--
>>>>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>>>>> <ma...@transpac.com>>
>>>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>>> *Subject:*RE: Unable to extract content from chunked portion of large
>>>>> file
>>>>> One option is to create your own RESTful API that lets you send
>>>>>chunks
>>>>> of the file, and then you can provide an input stream that provides
>>>>> the seamless data view of the chunks to Tika (which is what it
>>>>>needs).
>>>>>
>>>>> -- Ken
>>>>>
>>>>>>
>>>>>> 
>>>>>>---------------------------------------------------------------------
>>>>>>--
>>>>>> -
>>>>>> *From:*raghu vittal
>>>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>>>> *Subject:*Unable to extract content from chunked portion of large
>>>>>>file
>>>>>>
>>>>>> Hi All
>>>>>>
>>>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>>>> content and dump data in Elastic Search for full-text search.
>>>>>> sending very large files to Tika will cause out of memory exception.
>>>>>>
>>>>>> we want to chunk the file and send it to TIKA for content
>>>>>>extraction.
>>>>>> when we passed chunked portion of file to Tika it is giving empty
>>>>>> text.
>>>>>> I assume Tika is relied on file structure that why it is not giving
>>>>>> any content.
>>>>>>
>>>>>> we are using Tika Server(REST api) in our .net application.
>>>>>>
>>>>>> please suggest us better approach for this scenario.
>>>>>>
>>>>>> Regards,
>>>>>> Raghu.
>>>>
>>>>
>>>>
>>>> --------------------------
>>>> Ken Krugler
>>>> +1 530-210-6378
>>>> http://www.scaleunlimited.com
>>>> custom big data solutions & training
>>>> Hadoop, Cascading, Cassandra & Solr
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Sergey Beryozkin
>>>
>>> Talend Community Coders
>>> http://coders.talend.com/
>>
>


Re: Unable to extract content from chunked portion of large file

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Chris

Sure, I've opened
https://issues.apache.org/jira/browse/TIKA-1871

and assigned to myself, will add some info about multipart/form-data asap

Cheers, Sergey


On 24/02/16 16:40, Mattmann, Chris A (3980) wrote:
> +1 please just remove it from the wiki since it clearly supports
> that per your research thanks Sergey!
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Sergey Beryozkin <sb...@gmail.com>
> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
> Date: Wednesday, February 24, 2016 at 7:44 AM
> To: "user@tika.apache.org" <us...@tika.apache.org>
> Subject: Re: Unable to extract content from chunked portion of large file
>
>> Hi All
>>
>> If a large file is passed to a Tika server as a multipart/form payload
>>
>> then CXF will be creating a temp file on the disk itself.
>>
>> Hmm... I was looking for a reference to it and I found the advice not to
>> use multipart/form-data:
>> https://wiki.apache.org/tika/TikaJAXRS (in Services)
>>
>> I believe it should be removed, see:
>>
>> http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/a
>> pache/tika/server/resource/TikaResource.java,
>> example:
>>
>> @POST
>>      @Consumes("multipart/form-data")
>>      @Produces("text/plain")
>>      @Path("form")
>>      public StreamingOutput getTextFromMultipart(Attachment att,
>> @Context final UriInfo info) {
>>          return produceText(att.getObject(InputStream.class),
>> att.getHeaders(), info);
>>      }
>>
>>
>> Cheers, Sergey
>>
>>
>>
>> On 24/02/16 15:37, Ken Krugler wrote:
>>> Hi Raghu,
>>>
>>> I don't think you understood what I was proposing.
>>>
>>> I suggested creating a service that could receive chunks of the file
>>> (persisted to local disk). Then this service could implement an input
>>> stream class that would read sequentially from these pieces. This input
>>> stream would be passed to Tika, thus giving Tika a single continuous
>>> stream of data to the entire file content.
>>>
>>> -- Ken
>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> *From:* raghu vittal
>>>>
>>>> *Sent:* February 24, 2016 4:32:01am PST
>>>>
>>>> *To:* user@tika.apache.org <ma...@tika.apache.org>
>>>>
>>>> *Subject:* Re: Unable to extract content from chunked portion of large
>>>> file
>>>>
>>>>
>>>> Thanks for your reply.
>>>>
>>>> In our application user can upload large files. Our intention is to
>>>> extract the content out of large file and dump that in Elastic for
>>>> contented based search.
>>>> we have > 300 MB size .xlsx and .doc files. sending that large file to
>>>> Tika will causing timeout issues.
>>>>
>>>> i tried getting chunk of file and pass to Tika. Tika given me invalid
>>>> data exception.
>>>>
>>>> I Think for Tika we need to pass entire file at once to extract
>>>> content.
>>>>
>>>> Raghu.
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>>>> <ma...@transpac.com>>
>>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>> *Subject:*RE: Unable to extract content from chunked portion of large
>>>> file
>>>> One option is to create your own RESTful API that lets you send chunks
>>>> of the file, and then you can provide an input stream that provides
>>>> the seamless data view of the chunks to Tika (which is what it needs).
>>>>
>>>> -- Ken
>>>>
>>>>>
>>>>> -----------------------------------------------------------------------
>>>>> -
>>>>> *From:*raghu vittal
>>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>>> *Subject:*Unable to extract content from chunked portion of large file
>>>>>
>>>>> Hi All
>>>>>
>>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>>> content and dump data in Elastic Search for full-text search.
>>>>> sending very large files to Tika will cause out of memory exception.
>>>>>
>>>>> we want to chunk the file and send it to TIKA for content extraction.
>>>>> when we passed chunked portion of file to Tika it is giving empty
>>>>> text.
>>>>> I assume Tika is relied on file structure that why it is not giving
>>>>> any content.
>>>>>
>>>>> we are using Tika Server(REST api) in our .net application.
>>>>>
>>>>> please suggest us better approach for this scenario.
>>>>>
>>>>> Regards,
>>>>> Raghu.
>>>
>>>
>>>
>>> --------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://www.scaleunlimited.com
>>> custom big data solutions & training
>>> Hadoop, Cascading, Cassandra & Solr
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Sergey Beryozkin
>>
>> Talend Community Coders
>> http://coders.talend.com/
>


Re: Tika Wiki Login

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Chris, thanks, can't reply yet the issue is done :-) but will take 
care of it,

Thanks, Sergey
On 24/02/16 17:05, Mattmann, Chris A (3980) wrote:
> permission granted :)
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Sergey Beryozkin <sb...@gmail.com>
> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
> Date: Wednesday, February 24, 2016 at 9:03 AM
> To: "user@tika.apache.org" <us...@tika.apache.org>
> Subject: Tika Wiki Login
>
>> sergey_beryozkin
>


Re: Tika Wiki Login

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
permission granted :)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Sergey Beryozkin <sb...@gmail.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Wednesday, February 24, 2016 at 9:03 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: Tika Wiki Login

>sergey_beryozkin


Tika Wiki Login

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Chris

Can you please give me the rights to edit the wiki, I have all the docs 
signed. I can edit CXF and Camel wikis with a 'sergey_beryozkin' login, 
thought could do the same with Tika

Thanks, Sergey



Re: Unable to extract content from chunked portion of large file

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
+1 please just remove it from the wiki since it clearly supports
that per your research thanks Sergey!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Sergey Beryozkin <sb...@gmail.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Wednesday, February 24, 2016 at 7:44 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: Re: Unable to extract content from chunked portion of large file

>Hi All
>
>If a large file is passed to a Tika server as a multipart/form payload
>
>then CXF will be creating a temp file on the disk itself.
>
>Hmm... I was looking for a reference to it and I found the advice not to
>use multipart/form-data:
>https://wiki.apache.org/tika/TikaJAXRS (in Services)
>
>I believe it should be removed, see:
>
>http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/a
>pache/tika/server/resource/TikaResource.java,
>example:
>
>@POST
>     @Consumes("multipart/form-data")
>     @Produces("text/plain")
>     @Path("form")
>     public StreamingOutput getTextFromMultipart(Attachment att,
>@Context final UriInfo info) {
>         return produceText(att.getObject(InputStream.class),
>att.getHeaders(), info);
>     }
>
>
>Cheers, Sergey
>
>
>
>On 24/02/16 15:37, Ken Krugler wrote:
>> Hi Raghu,
>>
>> I don't think you understood what I was proposing.
>>
>> I suggested creating a service that could receive chunks of the file
>> (persisted to local disk). Then this service could implement an input
>> stream class that would read sequentially from these pieces. This input
>> stream would be passed to Tika, thus giving Tika a single continuous
>> stream of data to the entire file content.
>>
>> -- Ken
>>
>>> 
>>>------------------------------------------------------------------------
>>>
>>> *From:* raghu vittal
>>>
>>> *Sent:* February 24, 2016 4:32:01am PST
>>>
>>> *To:* user@tika.apache.org <ma...@tika.apache.org>
>>>
>>> *Subject:* Re: Unable to extract content from chunked portion of large
>>> file
>>>
>>>
>>> Thanks for your reply.
>>>
>>> In our application user can upload large files. Our intention is to
>>> extract the content out of large file and dump that in Elastic for
>>> contented based search.
>>> we have > 300 MB size .xlsx and .doc files. sending that large file to
>>> Tika will causing timeout issues.
>>>
>>> i tried getting chunk of file and pass to Tika. Tika given me invalid
>>> data exception.
>>>
>>> I Think for Tika we need to pass entire file at once to extract
>>>content.
>>>
>>> Raghu.
>>>
>>> 
>>>------------------------------------------------------------------------
>>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>>> <ma...@transpac.com>>
>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>> *Subject:*RE: Unable to extract content from chunked portion of large
>>>file
>>> One option is to create your own RESTful API that lets you send chunks
>>> of the file, and then you can provide an input stream that provides
>>> the seamless data view of the chunks to Tika (which is what it needs).
>>>
>>> -- Ken
>>>
>>>> 
>>>>-----------------------------------------------------------------------
>>>>-
>>>> *From:*raghu vittal
>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>> *Subject:*Unable to extract content from chunked portion of large file
>>>>
>>>> Hi All
>>>>
>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>> content and dump data in Elastic Search for full-text search.
>>>> sending very large files to Tika will cause out of memory exception.
>>>>
>>>> we want to chunk the file and send it to TIKA for content extraction.
>>>> when we passed chunked portion of file to Tika it is giving empty
>>>>text.
>>>> I assume Tika is relied on file structure that why it is not giving
>>>> any content.
>>>>
>>>> we are using Tika Server(REST api) in our .net application.
>>>>
>>>> please suggest us better approach for this scenario.
>>>>
>>>> Regards,
>>>> Raghu.
>>
>>
>>
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>
>>
>>
>>
>>
>
>
>-- 
>Sergey Beryozkin
>
>Talend Community Coders
>http://coders.talend.com/


Re: Unable to extract content from chunked portion of large file

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi All

If a large file is passed to a Tika server as a multipart/form payload

then CXF will be creating a temp file on the disk itself.

Hmm... I was looking for a reference to it and I found the advice not to
use multipart/form-data:
https://wiki.apache.org/tika/TikaJAXRS (in Services)

I believe it should be removed, see:

http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java, 
example:

@POST
     @Consumes("multipart/form-data")
     @Produces("text/plain")
     @Path("form")
     public StreamingOutput getTextFromMultipart(Attachment att, 
@Context final UriInfo info) {
         return produceText(att.getObject(InputStream.class), 
att.getHeaders(), info);
     }


Cheers, Sergey



On 24/02/16 15:37, Ken Krugler wrote:
> Hi Raghu,
>
> I don't think you understood what I was proposing.
>
> I suggested creating a service that could receive chunks of the file
> (persisted to local disk). Then this service could implement an input
> stream class that would read sequentially from these pieces. This input
> stream would be passed to Tika, thus giving Tika a single continuous
> stream of data to the entire file content.
>
> -- Ken
>
>> ------------------------------------------------------------------------
>>
>> *From:* raghu vittal
>>
>> *Sent:* February 24, 2016 4:32:01am PST
>>
>> *To:* user@tika.apache.org <ma...@tika.apache.org>
>>
>> *Subject:* Re: Unable to extract content from chunked portion of large
>> file
>>
>>
>> Thanks for your reply.
>>
>> In our application user can upload large files. Our intention is to
>> extract the content out of large file and dump that in Elastic for
>> contented based search.
>> we have > 300 MB size .xlsx and .doc files. sending that large file to
>> Tika will causing timeout issues.
>>
>> i tried getting chunk of file and pass to Tika. Tika given me invalid
>> data exception.
>>
>> I Think for Tika we need to pass entire file at once to extract content.
>>
>> Raghu.
>>
>> ------------------------------------------------------------------------
>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>> <ma...@transpac.com>>
>> *Sent:*Friday, February 19, 2016 8:22 PM
>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>> *Subject:*RE: Unable to extract content from chunked portion of large file
>> One option is to create your own RESTful API that lets you send chunks
>> of the file, and then you can provide an input stream that provides
>> the seamless data view of the chunks to Tika (which is what it needs).
>>
>> -- Ken
>>
>>> ------------------------------------------------------------------------
>>> *From:*raghu vittal
>>> *Sent:*February 19, 2016 1:37:49am PST
>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>> *Subject:*Unable to extract content from chunked portion of large file
>>>
>>> Hi All
>>>
>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>> content and dump data in Elastic Search for full-text search.
>>> sending very large files to Tika will cause out of memory exception.
>>>
>>> we want to chunk the file and send it to TIKA for content extraction.
>>> when we passed chunked portion of file to Tika it is giving empty text.
>>> I assume Tika is relied on file structure that why it is not giving
>>> any content.
>>>
>>> we are using Tika Server(REST api) in our .net application.
>>>
>>> please suggest us better approach for this scenario.
>>>
>>> Regards,
>>> Raghu.
>
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: Unable to extract content from chunked portion of large file

Posted by Sergey Beryozkin <sb...@gmail.com>.
What I meant was that while working with the issue 1 you see Tika 
reporting a Zip Bomb issue - this is different to the issue you were 
facing initially, which was OOM.

Thus, we can assume the 1st option of dealing with submitting a massive 
file (where you submit) works, except that you see a Zip Bomb issue 
which is about a Zip content being problematic. This is what I suggested 
you'd investigate in a different thread.
One thing I can suggest, that with HTTP client sending a massive 
payload, it makes sense to set connect and receive timeouts on the 
client side to some large values, check HttpClient docs on how to do it.
The stacktrace was saying something about the connection being aborted - 
the low receive/connect timeouts might've affected it - and in turn - it 
might've affected Tika mistakenly reporting a Zip Bomb...

So - try the 1st option again, with the timeouts set on the client side, 
if you will be still seeing a Zip Bomb issue - then investigate it 
separately, and also continue looking at other options suggested in this 
thread...

HTH, Sergey


On 29/02/16 14:41, raghu vittal wrote:
> Thanks for your reply.
>
> I actually started this thread for finding a way to extract content out of chunked portion of  file
>
> Will TIKA supports to extract content from file chunk.?
>
> Regards,
> Raghu.
> ________________________________________
> From: Sergey Beryozkin <sb...@gmail.com>
> Sent: Monday, February 29, 2016 7:23 PM
> To: user@tika.apache.org
> Subject: Re: Unable to extract content from chunked portion of large file
>
> Well, it is a different issue now, the server is processing a 250MB
> payload and throws an error:
>
> org.apache.tika.exception.TikaException: Zip bomb detected!
>
> So may be you need to start a new thread...
>
> Cheers, Sergey
> On 29/02/16 13:49, raghu vittal wrote:
>> it is working. thx
>>
>> i have tried sending 250MB file using multipart/form-data it is giving exception.
>>
>> ERROR:
>> Feb 29, 2016 7:07:27 PM org.apache.tika.server.resource.TikaResource logRequest
>> INFO: tika/form (autodetecting type)
>> Feb 29, 2016 7:09:02 PM org.apache.tika.server.resource.TikaResource parse
>> WARNING: tika/form: Text extraction failed
>> org.apache.tika.exception.TikaException: Zip bomb detected!
>>           at org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent
>> Handler.java:192)
>>           ... 31 more
>> Feb 29, 2016 7:09:02 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
>> roblem
>> SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
>> aResource$4, ContentType: text/plain
>> Feb 29, 2016 7:09:02 PM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogg
>> ing
>> WARNING: Interceptor for {http://resource.server.tika.apache.org/}MetadataResour
>> ce has thrown exception, unwinding now
>> org.apache.cxf.interceptor.Fault: Could not send Message.
>>           ... 24 more
>> Caused by: java.io.IOException: An established connection was aborted by the sof
>> tware in your host machine
>>           ... 35 more
>>
>>
>> and i have tried to get the chunk of file data and passed to tika using multipart/form-data getting exception.
>>
>> ERROR:
>> Feb 29, 2016 7:02:43 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
>> roblem
>> SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
>> aResource$4, ContentType: text/plain
>> Feb 29, 2016 7:04:30 PM org.apache.tika.server.resource.TikaResource logRequest
>> INFO: tika/form (autodetecting type)
>> Feb 29, 2016 7:04:30 PM org.apache.tika.server.resource.TikaResource parse
>> WARNING: tika/form: Text extraction failed
>> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.ap
>> ache.tika.parser.microsoft.ooxml.OOXMLParser@41530372
>> Feb 29, 2016 7:04:30 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
>> roblem
>> SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
>> aResource$4, ContentType: text/plain
>>
>>
>> we are struck up handling this scenarios. In our production we have documents  of this size. we need to handle this.
>>
>> please help us.
>>
>> Regards,
>> Raghu.
>>
>> ________________________________________
>> From: Sergey Beryozkin <sb...@gmail.com>
>> Sent: Monday, February 29, 2016 6:50 PM
>> To: user@tika.apache.org
>> Subject: Re: Unable to extract content from chunked portion of large file
>>
>> Hi
>>
>> In the first case it should be
>>
>> http://localhost:9998/tika/form
>>
>> Sergey
>> On 29/02/16 13:09, raghu vittal wrote:
>>> Hi ken,
>>>
>>>
>>> these are my observations ..
>>>
>>>
>>> scenario -1
>>>
>>>
>>> Tika Url : http://localhost:9998/tika
>>>
>>>
>>> I have tried the multipart/form-data  suggested by Sergey . i am getting
>>> below error (we are using tika 1.11 server)
>>>
>>> var data = File.ReadAllBytes(filename);
>>> using (var client = new HttpClient())
>>> {
>>> using (var content = new MultipartFormDataContent())
>>> {
>>> ByteArrayContent byteArrayContent = new ByteArrayContent(data);
>>> byteArrayContent.Headers.Add("Content-Type", "application/octet-stream");
>>> content.Add(byteArrayContent);
>>> var str = client.PutAsync(tikaServerUrl,
>>> content).Result.Content.ReadAsStringAsync().Result;
>>> }
>>>
>>> *ERROR*:
>>>
>>> Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource
>>> logRequest
>>> INFO: tika
>>> (multipart/form-data;boundary="03cc158f-3213-439f-a0be-3aba14c7036b")
>>>
>>> Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource parse
>>> WARNING: tika: Text extraction failed
>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
>>> from org.ap
>>> ache.tika.server.resource.TikaResource$1@36b1a1ec
>>>            at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282
>>> )
>>>            at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>>> 20)
>>>        ................................................
>>>            at java.lang.Thread.run(Thread.java:745)
>>> Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported
>>> Media Type
>>>            at
>>> org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.jav
>>> a:116)
>>>            at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
>>> )
>>>            ... 32 more
>>>
>>> Feb 29, 2016 5:26:01 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
>>> logMessageHandlerP
>>> roblem
>>> SEVERE: Problem with writing the data, class
>>> org.apache.tika.server.resource.Tik
>>> aResource$4, ContentType: text/plain
>>>
>>> I think TIKA does not support POST request.
>>>
>>> *
>>> *
>>>
>>> *Passing 240 MB file to tika for content extraction it is giving me the
>>> Errors.*
>>>
>>> *Scenario -2*
>>>
>>>
>>> Tika Url : http://localhost:9998/unpack/all
>>>
>>>
>>> Rather than ReadStringAsync() i have used ReadStreamAsync()  and
>>> captured the output stream to "ZipArchive"
>>>
>>>
>>> *ERROR:*
>>>
>>> Feb 29, 2016 6:03:26 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
>>> logMessageHandlerP
>>> roblem
>>> SEVERE: Problem with writing the data, class java.util.HashMap,
>>> ContentType: app
>>> lication/zip
>>> Feb 29, 2016 6:03:26 PM org.apache.cxf.phase.PhaseInterceptorChain
>>> doDefaultLogg
>>> ing
>>> WARNING: Interceptor for
>>> {http://resource.server.tika.apache.org/}MetadataResour
>>> ce has thrown exception, unwinding now
>>> org.apache.cxf.interceptor.Fault
>>>            at
>>> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleWriteExcep
>>> tion(JAXRSOutInterceptor.java:363)
>>>            ... 41 more
>>>
>>> Feb 29, 2016 6:03:28 PM org.apache.cxf.phase.PhaseInterceptorChain
>>> doDefaultLogg
>>> ing
>>> WARNING: Interceptor for
>>> {http://resource.server.tika.apache.org/}MetadataResour
>>> ce has thrown exception, unwinding now
>>> org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
>>>            at
>>> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
>>> leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
>>> Caused by: com.ctc.wstx.exc.WstxIOException: null
>>>            at
>>> com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:255)
>>>            at
>>> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
>>> leMessage(JAXRSDefaultFaultOutInterceptor.java:100)
>>>            ... 26 more
>>>
>>> *Scenario -3*
>>>
>>> Tika url : http://localhost:9998/tika
>>>
>>> *
>>> *
>>>
>>> *ERROR:*
>>> ****
>>> Feb 29, 2016 6:05:55 PM org.apache.tika.server.resource.TikaResource
>>> logRequest
>>> INFO: tika (autodetecting type)
>>> Feb 29, 2016 6:07:35 PM org.apache.tika.server.resource.TikaResource parse
>>> WARNING: tika: Text extraction failed
>>> org.apache.tika.exception.TikaException: Zip bomb detected!
>>>            at
>>> org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent
>>> Handler.java:192)
>>>            at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>>> 23)
>>>            at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>>> 20)
>>>            ... 31 more
>>>
>>> Feb 29, 2016 6:07:35 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
>>> logMessageHandlerP
>>> roblem
>>> SEVERE: Problem with writing the data, class
>>> org.apache.tika.server.resource.Tik
>>> aResource$4, ContentType: text/plain
>>> Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain
>>> doDefaultLogg
>>> ing
>>> WARNING: Interceptor for
>>> {http://resource.server.tika.apache.org/}MetadataResour
>>> ce has thrown exception, unwinding now
>>> org.apache.cxf.interceptor.Fault: Could not send Message.
>>>            at
>>> org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndi
>>> ngInterceptor.handleMessage(MessageSenderInterceptor.java:64)
>>>            ... 31 more
>>>
>>> Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain
>>> doDefaultLogg
>>> ing
>>> WARNING: Interceptor for
>>> {http://resource.server.tika.apache.org/}MetadataResour
>>> ce has thrown exception, unwinding now
>>> org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
>>>            at
>>> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
>>> leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
>>>            at
>>> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseIntercept
>>> orChain.java:307)
>>> ....
>>>
>>>
>>> i was able to extract the content using 80 MB document.
>>>
>>> If i split the large file in to chunks and pass it to Tika  giving me
>>> exceptions.
>>>
>>> i am building the solution  in .NET
>>>
>>> Regards,
>>> Raghu.
>>>
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Ken Krugler <kk...@transpac.com>
>>> *Sent:* Saturday, February 27, 2016 6:22 AM
>>> *To:* user@tika.apache.org
>>> *Subject:* RE: Unable to extract content from chunked portion of large file
>>> Hi Raghu,
>>>
>>> Previously you'd said
>>>
>>> "sending very large files to Tika will cause out of memory exception"
>>>
>>> and
>>>
>>> "sending that large file to Tika will causing timeout issues"
>>>
>>> I assume these are two different issues, as the second one seems related
>>> to how you're connecting to the Tika server via HTTP, correct?
>>>
>>> For out of memory issues, I'd suggested creating an input stream that
>>> can read from a chunked file *stored on disk*, thus alleviating at least
>>> part of the memory usage constraint. If the problem is that the
>>> resulting extracted text is also too big for memory, and you need to
>>> send it as a single document to Elasticsearch, then that's a separate
>>> (non-Tika) issue.
>>>
>>> For the timeout when sending the file to the Tika server, Sergey has
>>> already mentioned that you should be able to send it
>>> as multipart/form-data. And that will construct a temp file on disk from
>>> the chunks, and (I assume) stream it to Tika, so that also would take
>>> care of the same memory issue on the input side.
>>>
>>> Given the above, it seems like you've got enough ideas to try to solve
>>> this issue, yes?
>>>
>>> Regards,
>>>
>>> -- Ken
>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> *From:* raghu vittal
>>>>
>>>> *Sent:* February 24, 2016 10:50:29pm PST
>>>>
>>>> *To:* user@tika.apache.org <ma...@tika.apache.org>
>>>>
>>>> *Subject:* Re: Unable to extract content from chunked portion of large
>>>> file
>>>>
>>>>
>>>> Hi Ken,
>>>>
>>>> Thanks for the reply.
>>>> i understood your point.
>>>>
>>>> what i have tried.
>>>>
>>>>>    byte[] srcBytes = File.ReadAllBytes(filePath);
>>>>
>>>>> get the chunk  of 1 MB out of  srcBytes
>>>>
>>>>> when i pass this 1 MB chunk to Tika it is giving me the error.
>>>>
>>>>> As the WIKI Tika needs the entire file to extract content.
>>>>
>>>> this is where i struck. i don't wan't to pass entire file to Tika.
>>>>
>>>> correct me if i am wrong.
>>>>
>>>> --Raghu.
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>>>> <ma...@transpac.com>>
>>>> *Sent:*Wednesday, February 24, 2016 9:07 PM
>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>> *Subject:*RE: Unable to extract content from chunked portion of large
>>>> file
>>>> Hi Raghu,
>>>>
>>>> I don't think you understood what I was proposing.
>>>>
>>>> I suggested creating a service that could receive chunks of the file
>>>> (persisted to local disk). Then this service could implement an input
>>>> stream class that would read sequentially from these pieces. This
>>>> input stream would be passed to Tika, thus giving Tika a single
>>>> continuous stream of data to the entire file content.
>>>>
>>>> -- Ken
>>>>
>>>>> ------------------------------------------------------------------------
>>>>> *From:*raghu vittal
>>>>> *Sent:*February 24, 2016 4:32:01am PST
>>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>>> *Subject:*Re: Unable to extract content from chunked portion of large
>>>>> file
>>>>>
>>>>> Thanks for your reply.
>>>>>
>>>>> In our application user can upload large files. Our intention is to
>>>>> extract the content out of large file and dump that in Elastic for
>>>>> contented based search.
>>>>> we have > 300 MB size .xlsx and .doc files. sending that large file
>>>>> to Tika will causing timeout issues.
>>>>>
>>>>> i tried getting chunk of file and pass to Tika. Tika given me invalid
>>>>> data exception.
>>>>>
>>>>> I Think for Tika we need to pass entire file at once to extract content.
>>>>>
>>>>> Raghu.
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>>>>> <ma...@transpac.com>>
>>>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>>> *Subject:*RE: Unable to extract content from chunked portion of large
>>>>> file
>>>>> One option is to create your own RESTful API that lets you send
>>>>> chunks of the file, and then you can provide an input stream that
>>>>> provides the seamless data view of the chunks to Tika (which is what
>>>>> it needs).
>>>>>
>>>>> -- Ken
>>>>>
>>>>>> ------------------------------------------------------------------------
>>>>>> *From:*raghu vittal
>>>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>>>> *Subject:*Unable to extract content from chunked portion of large file
>>>>>>
>>>>>> Hi All
>>>>>>
>>>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>>>> content and dump data in Elastic Search for full-text search.
>>>>>> sending very large files to Tika will cause out of memory exception.
>>>>>>
>>>>>> we want to chunk the file and send it to TIKA for content
>>>>>> extraction. when we passed chunked portion of file to Tika it is
>>>>>> giving empty text.
>>>>>> I assume Tika is relied on file structure that why it is not giving
>>>>>> any content.
>>>>>>
>>>>>> we are using Tika Server(REST api) in our .net application.
>>>>>>
>>>>>> please suggest us better approach for this scenario.
>>>>>>
>>>>>> Regards,
>>>>>> Raghu.
>>>
>>> --------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://www.scaleunlimited.com
>>> custom big data solutions & training
>>> Hadoop, Cascading, Cassandra & Solr
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Sergey Beryozkin
>>
>> Talend Community Coders
>> http://coders.talend.com/
>>
>
>
> --
> Sergey Beryozkin
>
> Talend Community Coders
> http://coders.talend.com/
>


Re: Unable to extract content from chunked portion of large file

Posted by raghu vittal <rr...@live.com>.
Thanks for your reply.

I actually started this thread for finding a way to extract content out of chunked portion of  file 

Will TIKA supports to extract content from file chunk.?
 
Regards,
Raghu.
________________________________________
From: Sergey Beryozkin <sb...@gmail.com>
Sent: Monday, February 29, 2016 7:23 PM
To: user@tika.apache.org
Subject: Re: Unable to extract content from chunked portion of large file

Well, it is a different issue now, the server is processing a 250MB
payload and throws an error:

org.apache.tika.exception.TikaException: Zip bomb detected!

So may be you need to start a new thread...

Cheers, Sergey
On 29/02/16 13:49, raghu vittal wrote:
> it is working. thx
>
> i have tried sending 250MB file using multipart/form-data it is giving exception.
>
> ERROR:
> Feb 29, 2016 7:07:27 PM org.apache.tika.server.resource.TikaResource logRequest
> INFO: tika/form (autodetecting type)
> Feb 29, 2016 7:09:02 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika/form: Text extraction failed
> org.apache.tika.exception.TikaException: Zip bomb detected!
>          at org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent
> Handler.java:192)
>          ... 31 more
> Feb 29, 2016 7:09:02 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
> aResource$4, ContentType: text/plain
> Feb 29, 2016 7:09:02 PM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogg
> ing
> WARNING: Interceptor for {http://resource.server.tika.apache.org/}MetadataResour
> ce has thrown exception, unwinding now
> org.apache.cxf.interceptor.Fault: Could not send Message.
>          ... 24 more
> Caused by: java.io.IOException: An established connection was aborted by the sof
> tware in your host machine
>          ... 35 more
>
>
> and i have tried to get the chunk of file data and passed to tika using multipart/form-data getting exception.
>
> ERROR:
> Feb 29, 2016 7:02:43 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
> aResource$4, ContentType: text/plain
> Feb 29, 2016 7:04:30 PM org.apache.tika.server.resource.TikaResource logRequest
> INFO: tika/form (autodetecting type)
> Feb 29, 2016 7:04:30 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika/form: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.ap
> ache.tika.parser.microsoft.ooxml.OOXMLParser@41530372
> Feb 29, 2016 7:04:30 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
> aResource$4, ContentType: text/plain
>
>
> we are struck up handling this scenarios. In our production we have documents  of this size. we need to handle this.
>
> please help us.
>
> Regards,
> Raghu.
>
> ________________________________________
> From: Sergey Beryozkin <sb...@gmail.com>
> Sent: Monday, February 29, 2016 6:50 PM
> To: user@tika.apache.org
> Subject: Re: Unable to extract content from chunked portion of large file
>
> Hi
>
> In the first case it should be
>
> http://localhost:9998/tika/form
>
> Sergey
> On 29/02/16 13:09, raghu vittal wrote:
>> Hi ken,
>>
>>
>> these are my observations ..
>>
>>
>> scenario -1
>>
>>
>> Tika Url : http://localhost:9998/tika
>>
>>
>> I have tried the multipart/form-data  suggested by Sergey . i am getting
>> below error (we are using tika 1.11 server)
>>
>> var data = File.ReadAllBytes(filename);
>> using (var client = new HttpClient())
>> {
>> using (var content = new MultipartFormDataContent())
>> {
>> ByteArrayContent byteArrayContent = new ByteArrayContent(data);
>> byteArrayContent.Headers.Add("Content-Type", "application/octet-stream");
>> content.Add(byteArrayContent);
>> var str = client.PutAsync(tikaServerUrl,
>> content).Result.Content.ReadAsStringAsync().Result;
>> }
>>
>> *ERROR*:
>>
>> Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource
>> logRequest
>> INFO: tika
>> (multipart/form-data;boundary="03cc158f-3213-439f-a0be-3aba14c7036b")
>>
>> Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource parse
>> WARNING: tika: Text extraction failed
>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
>> from org.ap
>> ache.tika.server.resource.TikaResource$1@36b1a1ec
>>           at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282
>> )
>>           at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>> 20)
>>       ................................................
>>           at java.lang.Thread.run(Thread.java:745)
>> Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported
>> Media Type
>>           at
>> org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.jav
>> a:116)
>>           at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
>> )
>>           ... 32 more
>>
>> Feb 29, 2016 5:26:01 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
>> logMessageHandlerP
>> roblem
>> SEVERE: Problem with writing the data, class
>> org.apache.tika.server.resource.Tik
>> aResource$4, ContentType: text/plain
>>
>> I think TIKA does not support POST request.
>>
>> *
>> *
>>
>> *Passing 240 MB file to tika for content extraction it is giving me the
>> Errors.*
>>
>> *Scenario -2*
>>
>>
>> Tika Url : http://localhost:9998/unpack/all
>>
>>
>> Rather than ReadStringAsync() i have used ReadStreamAsync()  and
>> captured the output stream to "ZipArchive"
>>
>>
>> *ERROR:*
>>
>> Feb 29, 2016 6:03:26 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
>> logMessageHandlerP
>> roblem
>> SEVERE: Problem with writing the data, class java.util.HashMap,
>> ContentType: app
>> lication/zip
>> Feb 29, 2016 6:03:26 PM org.apache.cxf.phase.PhaseInterceptorChain
>> doDefaultLogg
>> ing
>> WARNING: Interceptor for
>> {http://resource.server.tika.apache.org/}MetadataResour
>> ce has thrown exception, unwinding now
>> org.apache.cxf.interceptor.Fault
>>           at
>> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleWriteExcep
>> tion(JAXRSOutInterceptor.java:363)
>>           ... 41 more
>>
>> Feb 29, 2016 6:03:28 PM org.apache.cxf.phase.PhaseInterceptorChain
>> doDefaultLogg
>> ing
>> WARNING: Interceptor for
>> {http://resource.server.tika.apache.org/}MetadataResour
>> ce has thrown exception, unwinding now
>> org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
>>           at
>> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
>> leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
>> Caused by: com.ctc.wstx.exc.WstxIOException: null
>>           at
>> com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:255)
>>           at
>> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
>> leMessage(JAXRSDefaultFaultOutInterceptor.java:100)
>>           ... 26 more
>>
>> *Scenario -3*
>>
>> Tika url : http://localhost:9998/tika
>>
>> *
>> *
>>
>> *ERROR:*
>> ****
>> Feb 29, 2016 6:05:55 PM org.apache.tika.server.resource.TikaResource
>> logRequest
>> INFO: tika (autodetecting type)
>> Feb 29, 2016 6:07:35 PM org.apache.tika.server.resource.TikaResource parse
>> WARNING: tika: Text extraction failed
>> org.apache.tika.exception.TikaException: Zip bomb detected!
>>           at
>> org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent
>> Handler.java:192)
>>           at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>> 23)
>>           at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>> 20)
>>           ... 31 more
>>
>> Feb 29, 2016 6:07:35 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
>> logMessageHandlerP
>> roblem
>> SEVERE: Problem with writing the data, class
>> org.apache.tika.server.resource.Tik
>> aResource$4, ContentType: text/plain
>> Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain
>> doDefaultLogg
>> ing
>> WARNING: Interceptor for
>> {http://resource.server.tika.apache.org/}MetadataResour
>> ce has thrown exception, unwinding now
>> org.apache.cxf.interceptor.Fault: Could not send Message.
>>           at
>> org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndi
>> ngInterceptor.handleMessage(MessageSenderInterceptor.java:64)
>>           ... 31 more
>>
>> Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain
>> doDefaultLogg
>> ing
>> WARNING: Interceptor for
>> {http://resource.server.tika.apache.org/}MetadataResour
>> ce has thrown exception, unwinding now
>> org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
>>           at
>> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
>> leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
>>           at
>> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseIntercept
>> orChain.java:307)
>> ....
>>
>>
>> i was able to extract the content using 80 MB document.
>>
>> If i split the large file in to chunks and pass it to Tika  giving me
>> exceptions.
>>
>> i am building the solution  in .NET
>>
>> Regards,
>> Raghu.
>>
>>
>> ------------------------------------------------------------------------
>> *From:* Ken Krugler <kk...@transpac.com>
>> *Sent:* Saturday, February 27, 2016 6:22 AM
>> *To:* user@tika.apache.org
>> *Subject:* RE: Unable to extract content from chunked portion of large file
>> Hi Raghu,
>>
>> Previously you'd said
>>
>> "sending very large files to Tika will cause out of memory exception"
>>
>> and
>>
>> "sending that large file to Tika will causing timeout issues"
>>
>> I assume these are two different issues, as the second one seems related
>> to how you're connecting to the Tika server via HTTP, correct?
>>
>> For out of memory issues, I'd suggested creating an input stream that
>> can read from a chunked file *stored on disk*, thus alleviating at least
>> part of the memory usage constraint. If the problem is that the
>> resulting extracted text is also too big for memory, and you need to
>> send it as a single document to Elasticsearch, then that's a separate
>> (non-Tika) issue.
>>
>> For the timeout when sending the file to the Tika server, Sergey has
>> already mentioned that you should be able to send it
>> as multipart/form-data. And that will construct a temp file on disk from
>> the chunks, and (I assume) stream it to Tika, so that also would take
>> care of the same memory issue on the input side.
>>
>> Given the above, it seems like you've got enough ideas to try to solve
>> this issue, yes?
>>
>> Regards,
>>
>> -- Ken
>>
>>> ------------------------------------------------------------------------
>>>
>>> *From:* raghu vittal
>>>
>>> *Sent:* February 24, 2016 10:50:29pm PST
>>>
>>> *To:* user@tika.apache.org <ma...@tika.apache.org>
>>>
>>> *Subject:* Re: Unable to extract content from chunked portion of large
>>> file
>>>
>>>
>>> Hi Ken,
>>>
>>> Thanks for the reply.
>>> i understood your point.
>>>
>>> what i have tried.
>>>
>>>>   byte[] srcBytes = File.ReadAllBytes(filePath);
>>>
>>>> get the chunk  of 1 MB out of  srcBytes
>>>
>>>> when i pass this 1 MB chunk to Tika it is giving me the error.
>>>
>>>> As the WIKI Tika needs the entire file to extract content.
>>>
>>> this is where i struck. i don't wan't to pass entire file to Tika.
>>>
>>> correct me if i am wrong.
>>>
>>> --Raghu.
>>>
>>> ------------------------------------------------------------------------
>>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>>> <ma...@transpac.com>>
>>> *Sent:*Wednesday, February 24, 2016 9:07 PM
>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>> *Subject:*RE: Unable to extract content from chunked portion of large
>>> file
>>> Hi Raghu,
>>>
>>> I don't think you understood what I was proposing.
>>>
>>> I suggested creating a service that could receive chunks of the file
>>> (persisted to local disk). Then this service could implement an input
>>> stream class that would read sequentially from these pieces. This
>>> input stream would be passed to Tika, thus giving Tika a single
>>> continuous stream of data to the entire file content.
>>>
>>> -- Ken
>>>
>>>> ------------------------------------------------------------------------
>>>> *From:*raghu vittal
>>>> *Sent:*February 24, 2016 4:32:01am PST
>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>> *Subject:*Re: Unable to extract content from chunked portion of large
>>>> file
>>>>
>>>> Thanks for your reply.
>>>>
>>>> In our application user can upload large files. Our intention is to
>>>> extract the content out of large file and dump that in Elastic for
>>>> contented based search.
>>>> we have > 300 MB size .xlsx and .doc files. sending that large file
>>>> to Tika will causing timeout issues.
>>>>
>>>> i tried getting chunk of file and pass to Tika. Tika given me invalid
>>>> data exception.
>>>>
>>>> I Think for Tika we need to pass entire file at once to extract content.
>>>>
>>>> Raghu.
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>>>> <ma...@transpac.com>>
>>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>> *Subject:*RE: Unable to extract content from chunked portion of large
>>>> file
>>>> One option is to create your own RESTful API that lets you send
>>>> chunks of the file, and then you can provide an input stream that
>>>> provides the seamless data view of the chunks to Tika (which is what
>>>> it needs).
>>>>
>>>> -- Ken
>>>>
>>>>> ------------------------------------------------------------------------
>>>>> *From:*raghu vittal
>>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>>> *Subject:*Unable to extract content from chunked portion of large file
>>>>>
>>>>> Hi All
>>>>>
>>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>>> content and dump data in Elastic Search for full-text search.
>>>>> sending very large files to Tika will cause out of memory exception.
>>>>>
>>>>> we want to chunk the file and send it to TIKA for content
>>>>> extraction. when we passed chunked portion of file to Tika it is
>>>>> giving empty text.
>>>>> I assume Tika is relied on file structure that why it is not giving
>>>>> any content.
>>>>>
>>>>> we are using Tika Server(REST api) in our .net application.
>>>>>
>>>>> please suggest us better approach for this scenario.
>>>>>
>>>>> Regards,
>>>>> Raghu.
>>
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>
>>
>>
>>
>>
>
>
> --
> Sergey Beryozkin
>
> Talend Community Coders
> http://coders.talend.com/
>


--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: Unable to extract content from chunked portion of large file

Posted by Sergey Beryozkin <sb...@gmail.com>.
Well, it is a different issue now, the server is processing a 250MB 
payload and throws an error:

org.apache.tika.exception.TikaException: Zip bomb detected!

So may be you need to start a new thread...

Cheers, Sergey
On 29/02/16 13:49, raghu vittal wrote:
> it is working. thx
>
> i have tried sending 250MB file using multipart/form-data it is giving exception.
>
> ERROR:
> Feb 29, 2016 7:07:27 PM org.apache.tika.server.resource.TikaResource logRequest
> INFO: tika/form (autodetecting type)
> Feb 29, 2016 7:09:02 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika/form: Text extraction failed
> org.apache.tika.exception.TikaException: Zip bomb detected!
>          at org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent
> Handler.java:192)
>          ... 31 more
> Feb 29, 2016 7:09:02 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
> aResource$4, ContentType: text/plain
> Feb 29, 2016 7:09:02 PM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogg
> ing
> WARNING: Interceptor for {http://resource.server.tika.apache.org/}MetadataResour
> ce has thrown exception, unwinding now
> org.apache.cxf.interceptor.Fault: Could not send Message.
>          ... 24 more
> Caused by: java.io.IOException: An established connection was aborted by the sof
> tware in your host machine
>          ... 35 more
>
>
> and i have tried to get the chunk of file data and passed to tika using multipart/form-data getting exception.
>
> ERROR:
> Feb 29, 2016 7:02:43 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
> aResource$4, ContentType: text/plain
> Feb 29, 2016 7:04:30 PM org.apache.tika.server.resource.TikaResource logRequest
> INFO: tika/form (autodetecting type)
> Feb 29, 2016 7:04:30 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika/form: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.ap
> ache.tika.parser.microsoft.ooxml.OOXMLParser@41530372
> Feb 29, 2016 7:04:30 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
> aResource$4, ContentType: text/plain		
>
>
> we are struck up handling this scenarios. In our production we have documents  of this size. we need to handle this.
>
> please help us.
>
> Regards,
> Raghu.
>
> ________________________________________
> From: Sergey Beryozkin <sb...@gmail.com>
> Sent: Monday, February 29, 2016 6:50 PM
> To: user@tika.apache.org
> Subject: Re: Unable to extract content from chunked portion of large file
>
> Hi
>
> In the first case it should be
>
> http://localhost:9998/tika/form
>
> Sergey
> On 29/02/16 13:09, raghu vittal wrote:
>> Hi ken,
>>
>>
>> these are my observations ..
>>
>>
>> scenario -1
>>
>>
>> Tika Url : http://localhost:9998/tika
>>
>>
>> I have tried the multipart/form-data  suggested by Sergey . i am getting
>> below error (we are using tika 1.11 server)
>>
>> var data = File.ReadAllBytes(filename);
>> using (var client = new HttpClient())
>> {
>> using (var content = new MultipartFormDataContent())
>> {
>> ByteArrayContent byteArrayContent = new ByteArrayContent(data);
>> byteArrayContent.Headers.Add("Content-Type", "application/octet-stream");
>> content.Add(byteArrayContent);
>> var str = client.PutAsync(tikaServerUrl,
>> content).Result.Content.ReadAsStringAsync().Result;
>> }
>>
>> *ERROR*:
>>
>> Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource
>> logRequest
>> INFO: tika
>> (multipart/form-data;boundary="03cc158f-3213-439f-a0be-3aba14c7036b")
>>
>> Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource parse
>> WARNING: tika: Text extraction failed
>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
>> from org.ap
>> ache.tika.server.resource.TikaResource$1@36b1a1ec
>>           at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282
>> )
>>           at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>> 20)
>>       ................................................
>>           at java.lang.Thread.run(Thread.java:745)
>> Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported
>> Media Type
>>           at
>> org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.jav
>> a:116)
>>           at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
>> )
>>           ... 32 more
>>
>> Feb 29, 2016 5:26:01 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
>> logMessageHandlerP
>> roblem
>> SEVERE: Problem with writing the data, class
>> org.apache.tika.server.resource.Tik
>> aResource$4, ContentType: text/plain
>>
>> I think TIKA does not support POST request.
>>
>> *
>> *
>>
>> *Passing 240 MB file to tika for content extraction it is giving me the
>> Errors.*
>>
>> *Scenario -2*
>>
>>
>> Tika Url : http://localhost:9998/unpack/all
>>
>>
>> Rather than ReadStringAsync() i have used ReadStreamAsync()  and
>> captured the output stream to "ZipArchive"
>>
>>
>> *ERROR:*
>>
>> Feb 29, 2016 6:03:26 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
>> logMessageHandlerP
>> roblem
>> SEVERE: Problem with writing the data, class java.util.HashMap,
>> ContentType: app
>> lication/zip
>> Feb 29, 2016 6:03:26 PM org.apache.cxf.phase.PhaseInterceptorChain
>> doDefaultLogg
>> ing
>> WARNING: Interceptor for
>> {http://resource.server.tika.apache.org/}MetadataResour
>> ce has thrown exception, unwinding now
>> org.apache.cxf.interceptor.Fault
>>           at
>> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleWriteExcep
>> tion(JAXRSOutInterceptor.java:363)
>>           ... 41 more
>>
>> Feb 29, 2016 6:03:28 PM org.apache.cxf.phase.PhaseInterceptorChain
>> doDefaultLogg
>> ing
>> WARNING: Interceptor for
>> {http://resource.server.tika.apache.org/}MetadataResour
>> ce has thrown exception, unwinding now
>> org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
>>           at
>> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
>> leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
>> Caused by: com.ctc.wstx.exc.WstxIOException: null
>>           at
>> com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:255)
>>           at
>> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
>> leMessage(JAXRSDefaultFaultOutInterceptor.java:100)
>>           ... 26 more
>>
>> *Scenario -3*
>>
>> Tika url : http://localhost:9998/tika
>>
>> *
>> *
>>
>> *ERROR:*
>> ****
>> Feb 29, 2016 6:05:55 PM org.apache.tika.server.resource.TikaResource
>> logRequest
>> INFO: tika (autodetecting type)
>> Feb 29, 2016 6:07:35 PM org.apache.tika.server.resource.TikaResource parse
>> WARNING: tika: Text extraction failed
>> org.apache.tika.exception.TikaException: Zip bomb detected!
>>           at
>> org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent
>> Handler.java:192)
>>           at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>> 23)
>>           at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>> 20)
>>           ... 31 more
>>
>> Feb 29, 2016 6:07:35 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
>> logMessageHandlerP
>> roblem
>> SEVERE: Problem with writing the data, class
>> org.apache.tika.server.resource.Tik
>> aResource$4, ContentType: text/plain
>> Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain
>> doDefaultLogg
>> ing
>> WARNING: Interceptor for
>> {http://resource.server.tika.apache.org/}MetadataResour
>> ce has thrown exception, unwinding now
>> org.apache.cxf.interceptor.Fault: Could not send Message.
>>           at
>> org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndi
>> ngInterceptor.handleMessage(MessageSenderInterceptor.java:64)
>>           ... 31 more
>>
>> Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain
>> doDefaultLogg
>> ing
>> WARNING: Interceptor for
>> {http://resource.server.tika.apache.org/}MetadataResour
>> ce has thrown exception, unwinding now
>> org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
>>           at
>> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
>> leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
>>           at
>> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseIntercept
>> orChain.java:307)
>> ....
>>
>>
>> i was able to extract the content using 80 MB document.
>>
>> If i split the large file in to chunks and pass it to Tika  giving me
>> exceptions.
>>
>> i am building the solution  in .NET
>>
>> Regards,
>> Raghu.
>>
>>
>> ------------------------------------------------------------------------
>> *From:* Ken Krugler <kk...@transpac.com>
>> *Sent:* Saturday, February 27, 2016 6:22 AM
>> *To:* user@tika.apache.org
>> *Subject:* RE: Unable to extract content from chunked portion of large file
>> Hi Raghu,
>>
>> Previously you'd said
>>
>> "sending very large files to Tika will cause out of memory exception"
>>
>> and
>>
>> "sending that large file to Tika will causing timeout issues"
>>
>> I assume these are two different issues, as the second one seems related
>> to how you're connecting to the Tika server via HTTP, correct?
>>
>> For out of memory issues, I'd suggested creating an input stream that
>> can read from a chunked file *stored on disk*, thus alleviating at least
>> part of the memory usage constraint. If the problem is that the
>> resulting extracted text is also too big for memory, and you need to
>> send it as a single document to Elasticsearch, then that's a separate
>> (non-Tika) issue.
>>
>> For the timeout when sending the file to the Tika server, Sergey has
>> already mentioned that you should be able to send it
>> as multipart/form-data. And that will construct a temp file on disk from
>> the chunks, and (I assume) stream it to Tika, so that also would take
>> care of the same memory issue on the input side.
>>
>> Given the above, it seems like you've got enough ideas to try to solve
>> this issue, yes?
>>
>> Regards,
>>
>> -- Ken
>>
>>> ------------------------------------------------------------------------
>>>
>>> *From:* raghu vittal
>>>
>>> *Sent:* February 24, 2016 10:50:29pm PST
>>>
>>> *To:* user@tika.apache.org <ma...@tika.apache.org>
>>>
>>> *Subject:* Re: Unable to extract content from chunked portion of large
>>> file
>>>
>>>
>>> Hi Ken,
>>>
>>> Thanks for the reply.
>>> i understood your point.
>>>
>>> what i have tried.
>>>
>>>>   byte[] srcBytes = File.ReadAllBytes(filePath);
>>>
>>>> get the chunk  of 1 MB out of  srcBytes
>>>
>>>> when i pass this 1 MB chunk to Tika it is giving me the error.
>>>
>>>> As the WIKI Tika needs the entire file to extract content.
>>>
>>> this is where i struck. i don't wan't to pass entire file to Tika.
>>>
>>> correct me if i am wrong.
>>>
>>> --Raghu.
>>>
>>> ------------------------------------------------------------------------
>>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>>> <ma...@transpac.com>>
>>> *Sent:*Wednesday, February 24, 2016 9:07 PM
>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>> *Subject:*RE: Unable to extract content from chunked portion of large
>>> file
>>> Hi Raghu,
>>>
>>> I don't think you understood what I was proposing.
>>>
>>> I suggested creating a service that could receive chunks of the file
>>> (persisted to local disk). Then this service could implement an input
>>> stream class that would read sequentially from these pieces. This
>>> input stream would be passed to Tika, thus giving Tika a single
>>> continuous stream of data to the entire file content.
>>>
>>> -- Ken
>>>
>>>> ------------------------------------------------------------------------
>>>> *From:*raghu vittal
>>>> *Sent:*February 24, 2016 4:32:01am PST
>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>> *Subject:*Re: Unable to extract content from chunked portion of large
>>>> file
>>>>
>>>> Thanks for your reply.
>>>>
>>>> In our application user can upload large files. Our intention is to
>>>> extract the content out of large file and dump that in Elastic for
>>>> contented based search.
>>>> we have > 300 MB size .xlsx and .doc files. sending that large file
>>>> to Tika will causing timeout issues.
>>>>
>>>> i tried getting chunk of file and pass to Tika. Tika given me invalid
>>>> data exception.
>>>>
>>>> I Think for Tika we need to pass entire file at once to extract content.
>>>>
>>>> Raghu.
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>>>> <ma...@transpac.com>>
>>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>> *Subject:*RE: Unable to extract content from chunked portion of large
>>>> file
>>>> One option is to create your own RESTful API that lets you send
>>>> chunks of the file, and then you can provide an input stream that
>>>> provides the seamless data view of the chunks to Tika (which is what
>>>> it needs).
>>>>
>>>> -- Ken
>>>>
>>>>> ------------------------------------------------------------------------
>>>>> *From:*raghu vittal
>>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>>> *Subject:*Unable to extract content from chunked portion of large file
>>>>>
>>>>> Hi All
>>>>>
>>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>>> content and dump data in Elastic Search for full-text search.
>>>>> sending very large files to Tika will cause out of memory exception.
>>>>>
>>>>> we want to chunk the file and send it to TIKA for content
>>>>> extraction. when we passed chunked portion of file to Tika it is
>>>>> giving empty text.
>>>>> I assume Tika is relied on file structure that why it is not giving
>>>>> any content.
>>>>>
>>>>> we are using Tika Server(REST api) in our .net application.
>>>>>
>>>>> please suggest us better approach for this scenario.
>>>>>
>>>>> Regards,
>>>>> Raghu.
>>
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>
>>
>>
>>
>>
>
>
> --
> Sergey Beryozkin
>
> Talend Community Coders
> http://coders.talend.com/
>


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: Unable to extract content from chunked portion of large file

Posted by raghu vittal <rr...@live.com>.
it is working. thx

i have tried sending 250MB file using multipart/form-data it is giving exception.

ERROR:
Feb 29, 2016 7:07:27 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: tika/form (autodetecting type)
Feb 29, 2016 7:09:02 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika/form: Text extraction failed
org.apache.tika.exception.TikaException: Zip bomb detected!
        at org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent
Handler.java:192)
        ... 31 more
Feb 29, 2016 7:09:02 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
aResource$4, ContentType: text/plain
Feb 29, 2016 7:09:02 PM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogg
ing
WARNING: Interceptor for {http://resource.server.tika.apache.org/}MetadataResour
ce has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault: Could not send Message.
        ... 24 more
Caused by: java.io.IOException: An established connection was aborted by the sof
tware in your host machine
        ... 35 more


and i have tried to get the chunk of file data and passed to tika using multipart/form-data getting exception.
 
ERROR:
Feb 29, 2016 7:02:43 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
aResource$4, ContentType: text/plain
Feb 29, 2016 7:04:30 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: tika/form (autodetecting type)
Feb 29, 2016 7:04:30 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika/form: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.ap
ache.tika.parser.microsoft.ooxml.OOXMLParser@41530372
Feb 29, 2016 7:04:30 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
aResource$4, ContentType: text/plain		


we are struck up handling this scenarios. In our production we have documents  of this size. we need to handle this.

please help us.

Regards,
Raghu.

________________________________________
From: Sergey Beryozkin <sb...@gmail.com>
Sent: Monday, February 29, 2016 6:50 PM
To: user@tika.apache.org
Subject: Re: Unable to extract content from chunked portion of large file

Hi

In the first case it should be

http://localhost:9998/tika/form

Sergey
On 29/02/16 13:09, raghu vittal wrote:
> Hi ken,
>
>
> these are my observations ..
>
>
> scenario -1
>
>
> Tika Url : http://localhost:9998/tika
>
>
> I have tried the multipart/form-data  suggested by Sergey . i am getting
> below error (we are using tika 1.11 server)
>
> var data = File.ReadAllBytes(filename);
> using (var client = new HttpClient())
> {
> using (var content = new MultipartFormDataContent())
> {
> ByteArrayContent byteArrayContent = new ByteArrayContent(data);
> byteArrayContent.Headers.Add("Content-Type", "application/octet-stream");
> content.Add(byteArrayContent);
> var str = client.PutAsync(tikaServerUrl,
> content).Result.Content.ReadAsStringAsync().Result;
> }
>
> *ERROR*:
>
> Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource
> logRequest
> INFO: tika
> (multipart/form-data;boundary="03cc158f-3213-439f-a0be-3aba14c7036b")
>
> Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException
> from org.ap
> ache.tika.server.resource.TikaResource$1@36b1a1ec
>          at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282
> )
>          at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
> 20)
>      ................................................
>          at java.lang.Thread.run(Thread.java:745)
> Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported
> Media Type
>          at
> org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.jav
> a:116)
>          at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
> )
>          ... 32 more
>
> Feb 29, 2016 5:26:01 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
> logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class
> org.apache.tika.server.resource.Tik
> aResource$4, ContentType: text/plain
>
> I think TIKA does not support POST request.
>
> *
> *
>
> *Passing 240 MB file to tika for content extraction it is giving me the
> Errors.*
>
> *Scenario -2*
>
>
> Tika Url : http://localhost:9998/unpack/all
>
>
> Rather than ReadStringAsync() i have used ReadStreamAsync()  and
> captured the output stream to "ZipArchive"
>
>
> *ERROR:*
>
> Feb 29, 2016 6:03:26 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
> logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class java.util.HashMap,
> ContentType: app
> lication/zip
> Feb 29, 2016 6:03:26 PM org.apache.cxf.phase.PhaseInterceptorChain
> doDefaultLogg
> ing
> WARNING: Interceptor for
> {http://resource.server.tika.apache.org/}MetadataResour
> ce has thrown exception, unwinding now
> org.apache.cxf.interceptor.Fault
>          at
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleWriteExcep
> tion(JAXRSOutInterceptor.java:363)
>          ... 41 more
>
> Feb 29, 2016 6:03:28 PM org.apache.cxf.phase.PhaseInterceptorChain
> doDefaultLogg
> ing
> WARNING: Interceptor for
> {http://resource.server.tika.apache.org/}MetadataResour
> ce has thrown exception, unwinding now
> org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
>          at
> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
> leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
> Caused by: com.ctc.wstx.exc.WstxIOException: null
>          at
> com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:255)
>          at
> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
> leMessage(JAXRSDefaultFaultOutInterceptor.java:100)
>          ... 26 more
>
> *Scenario -3*
>
> Tika url : http://localhost:9998/tika
>
> *
> *
>
> *ERROR:*
> ****
> Feb 29, 2016 6:05:55 PM org.apache.tika.server.resource.TikaResource
> logRequest
> INFO: tika (autodetecting type)
> Feb 29, 2016 6:07:35 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Zip bomb detected!
>          at
> org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent
> Handler.java:192)
>          at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
> 23)
>          at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
> 20)
>          ... 31 more
>
> Feb 29, 2016 6:07:35 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
> logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class
> org.apache.tika.server.resource.Tik
> aResource$4, ContentType: text/plain
> Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain
> doDefaultLogg
> ing
> WARNING: Interceptor for
> {http://resource.server.tika.apache.org/}MetadataResour
> ce has thrown exception, unwinding now
> org.apache.cxf.interceptor.Fault: Could not send Message.
>          at
> org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndi
> ngInterceptor.handleMessage(MessageSenderInterceptor.java:64)
>          ... 31 more
>
> Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain
> doDefaultLogg
> ing
> WARNING: Interceptor for
> {http://resource.server.tika.apache.org/}MetadataResour
> ce has thrown exception, unwinding now
> org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
>          at
> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
> leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
>          at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseIntercept
> orChain.java:307)
> ....
>
>
> i was able to extract the content using 80 MB document.
>
> If i split the large file in to chunks and pass it to Tika  giving me
> exceptions.
>
> i am building the solution  in .NET
>
> Regards,
> Raghu.
>
>
> ------------------------------------------------------------------------
> *From:* Ken Krugler <kk...@transpac.com>
> *Sent:* Saturday, February 27, 2016 6:22 AM
> *To:* user@tika.apache.org
> *Subject:* RE: Unable to extract content from chunked portion of large file
> Hi Raghu,
>
> Previously you'd said
>
> "sending very large files to Tika will cause out of memory exception"
>
> and
>
> "sending that large file to Tika will causing timeout issues"
>
> I assume these are two different issues, as the second one seems related
> to how you're connecting to the Tika server via HTTP, correct?
>
> For out of memory issues, I'd suggested creating an input stream that
> can read from a chunked file *stored on disk*, thus alleviating at least
> part of the memory usage constraint. If the problem is that the
> resulting extracted text is also too big for memory, and you need to
> send it as a single document to Elasticsearch, then that's a separate
> (non-Tika) issue.
>
> For the timeout when sending the file to the Tika server, Sergey has
> already mentioned that you should be able to send it
> as multipart/form-data. And that will construct a temp file on disk from
> the chunks, and (I assume) stream it to Tika, so that also would take
> care of the same memory issue on the input side.
>
> Given the above, it seems like you've got enough ideas to try to solve
> this issue, yes?
>
> Regards,
>
> -- Ken
>
>> ------------------------------------------------------------------------
>>
>> *From:* raghu vittal
>>
>> *Sent:* February 24, 2016 10:50:29pm PST
>>
>> *To:* user@tika.apache.org <ma...@tika.apache.org>
>>
>> *Subject:* Re: Unable to extract content from chunked portion of large
>> file
>>
>>
>> Hi Ken,
>>
>> Thanks for the reply.
>> i understood your point.
>>
>> what i have tried.
>>
>> >  byte[] srcBytes = File.ReadAllBytes(filePath);
>>
>> > get the chunk  of 1 MB out of  srcBytes
>>
>> > when i pass this 1 MB chunk to Tika it is giving me the error.
>>
>> > As the WIKI Tika needs the entire file to extract content.
>>
>> this is where i struck. i don't wan't to pass entire file to Tika.
>>
>> correct me if i am wrong.
>>
>> --Raghu.
>>
>> ------------------------------------------------------------------------
>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>> <ma...@transpac.com>>
>> *Sent:*Wednesday, February 24, 2016 9:07 PM
>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>> *Subject:*RE: Unable to extract content from chunked portion of large
>> file
>> Hi Raghu,
>>
>> I don't think you understood what I was proposing.
>>
>> I suggested creating a service that could receive chunks of the file
>> (persisted to local disk). Then this service could implement an input
>> stream class that would read sequentially from these pieces. This
>> input stream would be passed to Tika, thus giving Tika a single
>> continuous stream of data to the entire file content.
>>
>> -- Ken
>>
>>> ------------------------------------------------------------------------
>>> *From:*raghu vittal
>>> *Sent:*February 24, 2016 4:32:01am PST
>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>> *Subject:*Re: Unable to extract content from chunked portion of large
>>> file
>>>
>>> Thanks for your reply.
>>>
>>> In our application user can upload large files. Our intention is to
>>> extract the content out of large file and dump that in Elastic for
>>> contented based search.
>>> we have > 300 MB size .xlsx and .doc files. sending that large file
>>> to Tika will causing timeout issues.
>>>
>>> i tried getting chunk of file and pass to Tika. Tika given me invalid
>>> data exception.
>>>
>>> I Think for Tika we need to pass entire file at once to extract content.
>>>
>>> Raghu.
>>>
>>> ------------------------------------------------------------------------
>>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>>> <ma...@transpac.com>>
>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>> *Subject:*RE: Unable to extract content from chunked portion of large
>>> file
>>> One option is to create your own RESTful API that lets you send
>>> chunks of the file, and then you can provide an input stream that
>>> provides the seamless data view of the chunks to Tika (which is what
>>> it needs).
>>>
>>> -- Ken
>>>
>>>> ------------------------------------------------------------------------
>>>> *From:*raghu vittal
>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>> *Subject:*Unable to extract content from chunked portion of large file
>>>>
>>>> Hi All
>>>>
>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>> content and dump data in Elastic Search for full-text search.
>>>> sending very large files to Tika will cause out of memory exception.
>>>>
>>>> we want to chunk the file and send it to TIKA for content
>>>> extraction. when we passed chunked portion of file to Tika it is
>>>> giving empty text.
>>>> I assume Tika is relied on file structure that why it is not giving
>>>> any content.
>>>>
>>>> we are using Tika Server(REST api) in our .net application.
>>>>
>>>> please suggest us better approach for this scenario.
>>>>
>>>> Regards,
>>>> Raghu.
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>


--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: Unable to extract content from chunked portion of large file

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi

In the first case it should be

http://localhost:9998/tika/form

Sergey
On 29/02/16 13:09, raghu vittal wrote:
> Hi ken,
>
>
> these are my observations ..
>
>
> scenario -1
>
>
> Tika Url : http://localhost:9998/tika
>
>
> I have tried the multipart/form-data  suggested by Sergey . i am getting
> below error (we are using tika 1.11 server)
>
> var data = File.ReadAllBytes(filename);
> using (var client = new HttpClient())
> {
> using (var content = new MultipartFormDataContent())
> {
> ByteArrayContent byteArrayContent = new ByteArrayContent(data);
> byteArrayContent.Headers.Add("Content-Type", "application/octet-stream");
> content.Add(byteArrayContent);
> var str = client.PutAsync(tikaServerUrl,
> content).Result.Content.ReadAsStringAsync().Result;
> }
>
> *ERROR*:
>
> Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource
> logRequest
> INFO: tika
> (multipart/form-data;boundary="03cc158f-3213-439f-a0be-3aba14c7036b")
>
> Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException
> from org.ap
> ache.tika.server.resource.TikaResource$1@36b1a1ec
>          at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282
> )
>          at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
> 20)
>      ................................................
>          at java.lang.Thread.run(Thread.java:745)
> Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported
> Media Type
>          at
> org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.jav
> a:116)
>          at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
> )
>          ... 32 more
>
> Feb 29, 2016 5:26:01 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
> logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class
> org.apache.tika.server.resource.Tik
> aResource$4, ContentType: text/plain
>
> I think TIKA does not support POST request.
>
> *
> *
>
> *Passing 240 MB file to tika for content extraction it is giving me the
> Errors.*
>
> *Scenario -2*
>
>
> Tika Url : http://localhost:9998/unpack/all
>
>
> Rather than ReadStringAsync() i have used ReadStreamAsync()  and
> captured the output stream to "ZipArchive"
>
>
> *ERROR:*
>
> Feb 29, 2016 6:03:26 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
> logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class java.util.HashMap,
> ContentType: app
> lication/zip
> Feb 29, 2016 6:03:26 PM org.apache.cxf.phase.PhaseInterceptorChain
> doDefaultLogg
> ing
> WARNING: Interceptor for
> {http://resource.server.tika.apache.org/}MetadataResour
> ce has thrown exception, unwinding now
> org.apache.cxf.interceptor.Fault
>          at
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleWriteExcep
> tion(JAXRSOutInterceptor.java:363)
>          ... 41 more
>
> Feb 29, 2016 6:03:28 PM org.apache.cxf.phase.PhaseInterceptorChain
> doDefaultLogg
> ing
> WARNING: Interceptor for
> {http://resource.server.tika.apache.org/}MetadataResour
> ce has thrown exception, unwinding now
> org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
>          at
> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
> leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
> Caused by: com.ctc.wstx.exc.WstxIOException: null
>          at
> com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:255)
>          at
> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
> leMessage(JAXRSDefaultFaultOutInterceptor.java:100)
>          ... 26 more
>
> *Scenario -3*
>
> Tika url : http://localhost:9998/tika
>
> *
> *
>
> *ERROR:*
> ****
> Feb 29, 2016 6:05:55 PM org.apache.tika.server.resource.TikaResource
> logRequest
> INFO: tika (autodetecting type)
> Feb 29, 2016 6:07:35 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Zip bomb detected!
>          at
> org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent
> Handler.java:192)
>          at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
> 23)
>          at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
> 20)
>          ... 31 more
>
> Feb 29, 2016 6:07:35 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
> logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class
> org.apache.tika.server.resource.Tik
> aResource$4, ContentType: text/plain
> Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain
> doDefaultLogg
> ing
> WARNING: Interceptor for
> {http://resource.server.tika.apache.org/}MetadataResour
> ce has thrown exception, unwinding now
> org.apache.cxf.interceptor.Fault: Could not send Message.
>          at
> org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndi
> ngInterceptor.handleMessage(MessageSenderInterceptor.java:64)
>          ... 31 more
>
> Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain
> doDefaultLogg
> ing
> WARNING: Interceptor for
> {http://resource.server.tika.apache.org/}MetadataResour
> ce has thrown exception, unwinding now
> org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
>          at
> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
> leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
>          at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseIntercept
> orChain.java:307)
> ....
>
>
> i was able to extract the content using 80 MB document.
>
> If i split the large file in to chunks and pass it to Tika  giving me
> exceptions.
>
> i am building the solution  in .NET
>
> Regards,
> Raghu.
>
>
> ------------------------------------------------------------------------
> *From:* Ken Krugler <kk...@transpac.com>
> *Sent:* Saturday, February 27, 2016 6:22 AM
> *To:* user@tika.apache.org
> *Subject:* RE: Unable to extract content from chunked portion of large file
> Hi Raghu,
>
> Previously you'd said
>
> "sending very large files to Tika will cause out of memory exception"
>
> and
>
> "sending that large file to Tika will causing timeout issues"
>
> I assume these are two different issues, as the second one seems related
> to how you're connecting to the Tika server via HTTP, correct?
>
> For out of memory issues, I'd suggested creating an input stream that
> can read from a chunked file *stored on disk*, thus alleviating at least
> part of the memory usage constraint. If the problem is that the
> resulting extracted text is also too big for memory, and you need to
> send it as a single document to Elasticsearch, then that's a separate
> (non-Tika) issue.
>
> For the timeout when sending the file to the Tika server, Sergey has
> already mentioned that you should be able to send it
> as multipart/form-data. And that will construct a temp file on disk from
> the chunks, and (I assume) stream it to Tika, so that also would take
> care of the same memory issue on the input side.
>
> Given the above, it seems like you've got enough ideas to try to solve
> this issue, yes?
>
> Regards,
>
> -- Ken
>
>> ------------------------------------------------------------------------
>>
>> *From:* raghu vittal
>>
>> *Sent:* February 24, 2016 10:50:29pm PST
>>
>> *To:* user@tika.apache.org <ma...@tika.apache.org>
>>
>> *Subject:* Re: Unable to extract content from chunked portion of large
>> file
>>
>>
>> Hi Ken,
>>
>> Thanks for the reply.
>> i understood your point.
>>
>> what i have tried.
>>
>> >  byte[] srcBytes = File.ReadAllBytes(filePath);
>>
>> > get the chunk  of 1 MB out of  srcBytes
>>
>> > when i pass this 1 MB chunk to Tika it is giving me the error.
>>
>> > As the WIKI Tika needs the entire file to extract content.
>>
>> this is where i struck. i don't wan't to pass entire file to Tika.
>>
>> correct me if i am wrong.
>>
>> --Raghu.
>>
>> ------------------------------------------------------------------------
>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>> <ma...@transpac.com>>
>> *Sent:*Wednesday, February 24, 2016 9:07 PM
>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>> *Subject:*RE: Unable to extract content from chunked portion of large
>> file
>> Hi Raghu,
>>
>> I don't think you understood what I was proposing.
>>
>> I suggested creating a service that could receive chunks of the file
>> (persisted to local disk). Then this service could implement an input
>> stream class that would read sequentially from these pieces. This
>> input stream would be passed to Tika, thus giving Tika a single
>> continuous stream of data to the entire file content.
>>
>> -- Ken
>>
>>> ------------------------------------------------------------------------
>>> *From:*raghu vittal
>>> *Sent:*February 24, 2016 4:32:01am PST
>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>> *Subject:*Re: Unable to extract content from chunked portion of large
>>> file
>>>
>>> Thanks for your reply.
>>>
>>> In our application user can upload large files. Our intention is to
>>> extract the content out of large file and dump that in Elastic for
>>> contented based search.
>>> we have > 300 MB size .xlsx and .doc files. sending that large file
>>> to Tika will causing timeout issues.
>>>
>>> i tried getting chunk of file and pass to Tika. Tika given me invalid
>>> data exception.
>>>
>>> I Think for Tika we need to pass entire file at once to extract content.
>>>
>>> Raghu.
>>>
>>> ------------------------------------------------------------------------
>>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>>> <ma...@transpac.com>>
>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>> *Subject:*RE: Unable to extract content from chunked portion of large
>>> file
>>> One option is to create your own RESTful API that lets you send
>>> chunks of the file, and then you can provide an input stream that
>>> provides the seamless data view of the chunks to Tika (which is what
>>> it needs).
>>>
>>> -- Ken
>>>
>>>> ------------------------------------------------------------------------
>>>> *From:*raghu vittal
>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>> *Subject:*Unable to extract content from chunked portion of large file
>>>>
>>>> Hi All
>>>>
>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>> content and dump data in Elastic Search for full-text search.
>>>> sending very large files to Tika will cause out of memory exception.
>>>>
>>>> we want to chunk the file and send it to TIKA for content
>>>> extraction. when we passed chunked portion of file to Tika it is
>>>> giving empty text.
>>>> I assume Tika is relied on file structure that why it is not giving
>>>> any content.
>>>>
>>>> we are using Tika Server(REST api) in our .net application.
>>>>
>>>> please suggest us better approach for this scenario.
>>>>
>>>> Regards,
>>>> Raghu.
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: Unable to extract content from chunked portion of large file

Posted by raghu vittal <rr...@live.com>.
Hi ken,


these are my observations ..


scenario -1


Tika Url : http://localhost:9998/tika


I have tried the multipart/form-data  suggested by Sergey . i am getting below error (we are using tika 1.11 server)

var data = File.ReadAllBytes(filename);
using (var client = new HttpClient())
{
using (var content = new MultipartFormDataContent())
{
ByteArrayContent byteArrayContent = new ByteArrayContent(data);
byteArrayContent.Headers.Add("Content-Type", "application/octet-stream");
content.Add(byteArrayContent);
var str = client.PutAsync(tikaServerUrl, content).Result.Content.ReadAsStringAsync().Result;
}


ERROR:

Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: tika (multipart/form-data;boundary="03cc158f-3213-439f-a0be-3aba14c7036b")

Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.ap
ache.tika.server.resource.TikaResource$1@36b1a1ec
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282
)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
20)
    ................................................
        at java.lang.Thread.run(Thread.java:745)
Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported Media Type
        at org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.jav
a:116)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
)
        ... 32 more

Feb 29, 2016 5:26:01 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
aResource$4, ContentType: text/plain

I think TIKA does not support POST request.



Passing 240 MB file to tika for content extraction it is giving me the Errors.

Scenario -2


Tika Url : http://localhost:9998/unpack/all


Rather than ReadStringAsync() i have used ReadStreamAsync()  and captured the output stream to "ZipArchive"


ERROR:

Feb 29, 2016 6:03:26 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class java.util.HashMap, ContentType: app
lication/zip
Feb 29, 2016 6:03:26 PM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogg
ing
WARNING: Interceptor for {http://resource.server.tika.apache.org/}MetadataResour
ce has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault
        at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleWriteExcep
tion(JAXRSOutInterceptor.java:363)

        ... 41 more

Feb 29, 2016 6:03:28 PM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogg
ing
WARNING: Interceptor for {http://resource.server.tika.apache.org/}MetadataResour
ce has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
        at org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
leMessage(JAXRSDefaultFaultOutInterceptor.java:102)

Caused by: com.ctc.wstx.exc.WstxIOException: null
        at com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:255)
        at org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
leMessage(JAXRSDefaultFaultOutInterceptor.java:100)
        ... 26 more


Scenario -3

Tika url : http://localhost:9998/tika


ERROR:
Feb 29, 2016 6:05:55 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: tika (autodetecting type)
Feb 29, 2016 6:07:35 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika: Text extraction failed
org.apache.tika.exception.TikaException: Zip bomb detected!
        at org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent
Handler.java:192)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
23)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
20)
        ... 31 more

Feb 29, 2016 6:07:35 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
aResource$4, ContentType: text/plain
Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogg
ing
WARNING: Interceptor for {http://resource.server.tika.apache.org/}MetadataResour
ce has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault: Could not send Message.
        at org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndi
ngInterceptor.handleMessage(MessageSenderInterceptor.java:64)
        ... 31 more

Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogg
ing
WARNING: Interceptor for {http://resource.server.tika.apache.org/}MetadataResour
ce has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
        at org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
        at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseIntercept
orChain.java:307)
....


i was able to extract the content using 80 MB document.

If i split the large file in to chunks and pass it to Tika  giving me exceptions.

i am building the solution  in .NET

Regards,
Raghu.


________________________________
From: Ken Krugler <kk...@transpac.com>
Sent: Saturday, February 27, 2016 6:22 AM
To: user@tika.apache.org
Subject: RE: Unable to extract content from chunked portion of large file

Hi Raghu,

Previously you'd said

"sending very large files to Tika will cause out of memory exception"

and

"sending that large file to Tika will causing timeout issues"

I assume these are two different issues, as the second one seems related to how you're connecting to the Tika server via HTTP, correct?

For out of memory issues, I'd suggested creating an input stream that can read from a chunked file *stored on disk*, thus alleviating at least part of the memory usage constraint. If the problem is that the resulting extracted text is also too big for memory, and you need to send it as a single document to Elasticsearch, then that's a separate (non-Tika) issue.

For the timeout when sending the file to the Tika server, Sergey has already mentioned that you should be able to send it as multipart/form-data. And that will construct a temp file on disk from the chunks, and (I assume) stream it to Tika, so that also would take care of the same memory issue on the input side.

Given the above, it seems like you've got enough ideas to try to solve this issue, yes?

Regards,

-- Ken

________________________________

From: raghu vittal

Sent: February 24, 2016 10:50:29pm PST

To: user@tika.apache.org<ma...@tika.apache.org>

Subject: Re: Unable to extract content from chunked portion of large file

Hi Ken,

Thanks for the reply.
i understood your point.

what i have tried.

>  byte[] srcBytes = File.ReadAllBytes(filePath);

> get the chunk  of 1 MB out of  srcBytes


> when i pass this 1 MB chunk to Tika it is giving me the error.

> As the WIKI Tika needs the entire file to extract content.

this is where i struck. i don't wan't to pass entire file to Tika.

correct me if i am wrong.

--Raghu.

________________________________
From: Ken Krugler <kk...@transpac.com>>
Sent: Wednesday, February 24, 2016 9:07 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Unable to extract content from chunked portion of large file

Hi Raghu,

I don't think you understood what I was proposing.

I suggested creating a service that could receive chunks of the file (persisted to local disk). Then this service could implement an input stream class that would read sequentially from these pieces. This input stream would be passed to Tika, thus giving Tika a single continuous stream of data to the entire file content.

-- Ken

________________________________
From: raghu vittal
Sent: February 24, 2016 4:32:01am PST
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Unable to extract content from chunked portion of large file

Thanks for your reply.

In our application user can upload large files. Our intention is to extract the content out of large file and dump that in Elastic for contented based search.
we have > 300 MB size .xlsx and .doc files. sending that large file to Tika will causing timeout issues.

i tried getting chunk of file and pass to Tika. Tika given me invalid data exception.

I Think for Tika we need to pass entire file at once to extract content.

Raghu.

________________________________
From: Ken Krugler <kk...@transpac.com>>
Sent: Friday, February 19, 2016 8:22 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Unable to extract content from chunked portion of large file

One option is to create your own RESTful API that lets you send chunks of the file, and then you can provide an input stream that provides the seamless data view of the chunks to Tika (which is what it needs).

-- Ken

________________________________
From: raghu vittal
Sent: February 19, 2016 1:37:49am PST
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Unable to extract content from chunked portion of large file


Hi All

we have very large PDF,.docx,.xlsx. We are using Tika to extract content and dump data in Elastic Search for full-text search.
sending very large files to Tika will cause out of memory exception.

we want to chunk the file and send it to TIKA for content extraction. when we passed chunked portion of file to Tika it is giving empty text.
I assume Tika is relied on file structure that why it is not giving any content.

we are using Tika Server(REST api) in our .net application.

please suggest us better approach for this scenario.

Regards,
Raghu.

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






RE: Unable to extract content from chunked portion of large file

Posted by Ken Krugler <kk...@transpac.com>.
Hi Raghu,

Previously you'd said

"sending very large files to Tika will cause out of memory exception"

and

"sending that large file to Tika will causing timeout issues"

I assume these are two different issues, as the second one seems related to how you're connecting to the Tika server via HTTP, correct?

For out of memory issues, I'd suggested creating an input stream that can read from a chunked file *stored on disk*, thus alleviating at least part of the memory usage constraint. If the problem is that the resulting extracted text is also too big for memory, and you need to send it as a single document to Elasticsearch, then that's a separate (non-Tika) issue.

For the timeout when sending the file to the Tika server, Sergey has already mentioned that you should be able to send it as multipart/form-data. And that will construct a temp file on disk from the chunks, and (I assume) stream it to Tika, so that also would take care of the same memory issue on the input side.

Given the above, it seems like you've got enough ideas to try to solve this issue, yes?

Regards,

-- Ken

> From: raghu vittal
> Sent: February 24, 2016 10:50:29pm PST
> To: user@tika.apache.org
> Subject: Re: Unable to extract content from chunked portion of large file
> 
> Hi Ken,
> 
> Thanks for the reply.
> i understood your point. 
> 
> what i have tried.
> 
> >  byte[] srcBytes = File.ReadAllBytes(filePath);
> 
> > get the chunk  of 1 MB out of  srcBytes
> 
> > when i pass this 1 MB chunk to Tika it is giving me the error.
> 
> > As the WIKI Tika needs the entire file to extract content.
> 
> this is where i struck. i don't wan't to pass entire file to Tika.
> 
> correct me if i am wrong.
> 
> --Raghu.
> 
> From: Ken Krugler <kk...@transpac.com>
> Sent: Wednesday, February 24, 2016 9:07 PM
> To: user@tika.apache.org
> Subject: RE: Unable to extract content from chunked portion of large file
>  
> Hi Raghu,
> 
> I don't think you understood what I was proposing.
> 
> I suggested creating a service that could receive chunks of the file (persisted to local disk). Then this service could implement an input stream class that would read sequentially from these pieces. This input stream would be passed to Tika, thus giving Tika a single continuous stream of data to the entire file content.
> 
> -- Ken
> 
>> From: raghu vittal
>> Sent: February 24, 2016 4:32:01am PST
>> To: user@tika.apache.org
>> Subject: Re: Unable to extract content from chunked portion of large file
>> 
>> Thanks for your reply.
>> 
>> In our application user can upload large files. Our intention is to extract the content out of large file and dump that in Elastic for contented based search.
>> we have > 300 MB size .xlsx and .doc files. sending that large file to Tika will causing timeout issues.
>> 
>> i tried getting chunk of file and pass to Tika. Tika given me invalid data exception.
>> 
>> I Think for Tika we need to pass entire file at once to extract content. 
>> 
>> Raghu.
>> 
>> From: Ken Krugler <kk...@transpac.com>
>> Sent: Friday, February 19, 2016 8:22 PM
>> To: user@tika.apache.org
>> Subject: RE: Unable to extract content from chunked portion of large file
>>  
>> One option is to create your own RESTful API that lets you send chunks of the file, and then you can provide an input stream that provides the seamless data view of the chunks to Tika (which is what it needs).
>> 
>> -- Ken
>> 
>>> From: raghu vittal
>>> Sent: February 19, 2016 1:37:49am PST
>>> To: user@tika.apache.org
>>> Subject: Unable to extract content from chunked portion of large file
>>> 
>>> Hi All
>>> 
>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract content and dump data in Elastic Search for full-text search.
>>> sending very large files to Tika will cause out of memory exception. 
>>> 
>>> we want to chunk the file and send it to TIKA for content extraction. when we passed chunked portion of file to Tika it is giving empty text.
>>> I assume Tika is relied on file structure that why it is not giving any content.
>>> 
>>> we are using Tika Server(REST api) in our .net application.
>>> 
>>> please suggest us better approach for this scenario.
>>> 
>>> Regards,
>>> Raghu.

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: Unable to extract content from chunked portion of large file

Posted by raghu vittal <rr...@live.com>.
Hi Ken,


Thanks for the reply.

i understood your point.


what i have tried.


>  byte[] srcBytes = File.ReadAllBytes(filePath);


> get the chunk  of 1 MB out of  srcBytes


> when i pass this 1 MB chunk to Tika it is giving me the error.


> As the WIKI Tika needs the entire file to extract content.

this is where i struck. i don't wan't to pass entire file to Tika.

correct me if i am wrong.

--Raghu.

________________________________
From: Ken Krugler <kk...@transpac.com>
Sent: Wednesday, February 24, 2016 9:07 PM
To: user@tika.apache.org
Subject: RE: Unable to extract content from chunked portion of large file

Hi Raghu,

I don't think you understood what I was proposing.

I suggested creating a service that could receive chunks of the file (persisted to local disk). Then this service could implement an input stream class that would read sequentially from these pieces. This input stream would be passed to Tika, thus giving Tika a single continuous stream of data to the entire file content.

-- Ken

________________________________

From: raghu vittal

Sent: February 24, 2016 4:32:01am PST

To: user@tika.apache.org<ma...@tika.apache.org>

Subject: Re: Unable to extract content from chunked portion of large file

Thanks for your reply.

In our application user can upload large files. Our intention is to extract the content out of large file and dump that in Elastic for contented based search.
we have > 300 MB size .xlsx and .doc files. sending that large file to Tika will causing timeout issues.

i tried getting chunk of file and pass to Tika. Tika given me invalid data exception.

I Think for Tika we need to pass entire file at once to extract content.

Raghu.

________________________________
From: Ken Krugler <kk...@transpac.com>>
Sent: Friday, February 19, 2016 8:22 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Unable to extract content from chunked portion of large file

One option is to create your own RESTful API that lets you send chunks of the file, and then you can provide an input stream that provides the seamless data view of the chunks to Tika (which is what it needs).

-- Ken

________________________________
From: raghu vittal
Sent: February 19, 2016 1:37:49am PST
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Unable to extract content from chunked portion of large file


Hi All

we have very large PDF,.docx,.xlsx. We are using Tika to extract content and dump data in Elastic Search for full-text search.
sending very large files to Tika will cause out of memory exception.

we want to chunk the file and send it to TIKA for content extraction. when we passed chunked portion of file to Tika it is giving empty text.
I assume Tika is relied on file structure that why it is not giving any content.

we are using Tika Server(REST api) in our .net application.

please suggest us better approach for this scenario.

Regards,
Raghu.



--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






RE: Unable to extract content from chunked portion of large file

Posted by Ken Krugler <kk...@transpac.com>.
Hi Raghu,

I don't think you understood what I was proposing.

I suggested creating a service that could receive chunks of the file (persisted to local disk). Then this service could implement an input stream class that would read sequentially from these pieces. This input stream would be passed to Tika, thus giving Tika a single continuous stream of data to the entire file content.

-- Ken

> From: raghu vittal
> Sent: February 24, 2016 4:32:01am PST
> To: user@tika.apache.org
> Subject: Re: Unable to extract content from chunked portion of large file
> 
> Thanks for your reply.
> 
> In our application user can upload large files. Our intention is to extract the content out of large file and dump that in Elastic for contented based search.
> we have > 300 MB size .xlsx and .doc files. sending that large file to Tika will causing timeout issues.
> 
> i tried getting chunk of file and pass to Tika. Tika given me invalid data exception.
> 
> I Think for Tika we need to pass entire file at once to extract content. 
> 
> Raghu.
> 
> From: Ken Krugler <kk...@transpac.com>
> Sent: Friday, February 19, 2016 8:22 PM
> To: user@tika.apache.org
> Subject: RE: Unable to extract content from chunked portion of large file
>  
> One option is to create your own RESTful API that lets you send chunks of the file, and then you can provide an input stream that provides the seamless data view of the chunks to Tika (which is what it needs).
> 
> -- Ken
> 
>> From: raghu vittal
>> Sent: February 19, 2016 1:37:49am PST
>> To: user@tika.apache.org
>> Subject: Unable to extract content from chunked portion of large file
>> 
>> Hi All
>> 
>> we have very large PDF,.docx,.xlsx. We are using Tika to extract content and dump data in Elastic Search for full-text search.
>> sending very large files to Tika will cause out of memory exception. 
>> 
>> we want to chunk the file and send it to TIKA for content extraction. when we passed chunked portion of file to Tika it is giving empty text.
>> I assume Tika is relied on file structure that why it is not giving any content.
>> 
>> we are using Tika Server(REST api) in our .net application.
>> 
>> please suggest us better approach for this scenario.
>> 
>> Regards,
>> Raghu.



--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: Unable to extract content from chunked portion of large file

Posted by raghu vittal <rr...@live.com>.
Thanks for your reply.


In our application user can upload large files. Our intention is to extract the content out of large file and dump that in Elastic for contented based search.

we have > 300 MB size .xlsx and .doc files. sending that large file to Tika will causing timeout issues.


i tried getting chunk of file and pass to Tika. Tika given me invalid data exception.


I Think for Tika we need to pass entire file at once to extract content.


Raghu.

________________________________
From: Ken Krugler <kk...@transpac.com>
Sent: Friday, February 19, 2016 8:22 PM
To: user@tika.apache.org
Subject: RE: Unable to extract content from chunked portion of large file

One option is to create your own RESTful API that lets you send chunks of the file, and then you can provide an input stream that provides the seamless data view of the chunks to Tika (which is what it needs).

-- Ken

________________________________

From: raghu vittal

Sent: February 19, 2016 1:37:49am PST

To: user@tika.apache.org<ma...@tika.apache.org>

Subject: Unable to extract content from chunked portion of large file


Hi All

we have very large PDF,.docx,.xlsx. We are using Tika to extract content and dump data in Elastic Search for full-text search.
sending very large files to Tika will cause out of memory exception.

we want to chunk the file and send it to TIKA for content extraction. when we passed chunked portion of file to Tika it is giving empty text.
I assume Tika is relied on file structure that why it is not giving any content.

we are using Tika Server(REST api) in our .net application.

please suggest us better approach for this scenario.

Regards,
Raghu.



--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






RE: Unable to extract content from chunked portion of large file

Posted by Ken Krugler <kk...@transpac.com>.
One option is to create your own RESTful API that lets you send chunks of the file, and then you can provide an input stream that provides the seamless data view of the chunks to Tika (which is what it needs).

-- Ken

> From: raghu vittal
> Sent: February 19, 2016 1:37:49am PST
> To: user@tika.apache.org
> Subject: Unable to extract content from chunked portion of large file
> 
> Hi All
> 
> we have very large PDF,.docx,.xlsx. We are using Tika to extract content and dump data in Elastic Search for full-text search.
> sending very large files to Tika will cause out of memory exception. 
> 
> we want to chunk the file and send it to TIKA for content extraction. when we passed chunked portion of file to Tika it is giving empty text.
> I assume Tika is relied on file structure that why it is not giving any content.
> 
> we are using Tika Server(REST api) in our .net application.
> 
> please suggest us better approach for this scenario.
> 
> Regards,
> Raghu.
> 
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr