You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by raghu vittal <rr...@live.com> on 2016/02/19 10:37:49 UTC

Unable to extract content from chunked portion of large file

Hi All

we have very large PDF,.docx,.xlsx. We are using Tika to extract content and dump data in Elastic Search for full-text search.
sending very large files to Tika will cause out of memory exception.

we want to chunk the file and send it to TIKA for content extraction. when we passed chunked portion of file to Tika it is giving empty text.
I assume Tika is relied on file structure that why it is not giving any content.

we are using Tika Server(REST api) in our .net application.

please suggest us better approach for this scenario.

Regards,
Raghu.

RE: Unable to extract content from chunked portion of large file

Posted by Ken Krugler <kk...@transpac.com>.

Hi Sergey,

Thanks for digging into the code - I'd seen the docs and assumed it wouldn't work.

Anybody have a chance to give that a try? Maybe Raghu? :)

-- Ken

> From: Sergey Beryozkin
> Sent: February 24, 2016 7:44:13am PST
> To: user@tika.apache.org
> Subject: Re: Unable to extract content from chunked portion of large file
> 
> Hi All
> 
> If a large file is passed to a Tika server as a multipart/form payload
> 
> then CXF will be creating a temp file on the disk itself.
> 
> Hmm... I was looking for a reference to it and I found the advice not to
> use multipart/form-data:
> https://wiki.apache.org/tika/TikaJAXRS (in Services)
> 
> I believe it should be removed, see:
> 
> http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java, example:
> 
> @POST
>    @Consumes("multipart/form-data")
>    @Produces("text/plain")
>    @Path("form")
>    public StreamingOutput getTextFromMultipart(Attachment att, @Context final UriInfo info) {
>        return produceText(att.getObject(InputStream.class), att.getHeaders(), info);
>    }
> 
> 
> Cheers, Sergey
> 
> 
> 
> On 24/02/16 15:37, Ken Krugler wrote:
>> Hi Raghu,
>> 
>> I don't think you understood what I was proposing.
>> 
>> I suggested creating a service that could receive chunks of the file
>> (persisted to local disk). Then this service could implement an input
>> stream class that would read sequentially from these pieces. This input
>> stream would be passed to Tika, thus giving Tika a single continuous
>> stream of data to the entire file content.
>> 
>> -- Ken
>> 
>>> ------------------------------------------------------------------------
>>> 
>>> *From:* raghu vittal
>>> 
>>> *Sent:* February 24, 2016 4:32:01am PST
>>> 
>>> *To:* user@tika.apache.org <ma...@tika.apache.org>
>>> 
>>> *Subject:* Re: Unable to extract content from chunked portion of large
>>> file
>>> 
>>> 
>>> Thanks for your reply.
>>> 
>>> In our application user can upload large files. Our intention is to
>>> extract the content out of large file and dump that in Elastic for
>>> contented based search.
>>> we have > 300 MB size .xlsx and .doc files. sending that large file to
>>> Tika will causing timeout issues.
>>> 
>>> i tried getting chunk of file and pass to Tika. Tika given me invalid
>>> data exception.
>>> 
>>> I Think for Tika we need to pass entire file at once to extract content.
>>> 
>>> Raghu.
>>> 
>>> ------------------------------------------------------------------------
>>> *From:*Ken Krugler <kkrugler_lists@transpac.com
>>> <ma...@transpac.com>>
>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>> *Subject:*RE: Unable to extract content from chunked portion of large file
>>> One option is to create your own RESTful API that lets you send chunks
>>> of the file, and then you can provide an input stream that provides
>>> the seamless data view of the chunks to Tika (which is what it needs).
>>> 
>>> -- Ken
>>> 
>>>> ------------------------------------------------------------------------
>>>> *From:*raghu vittal
>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>> *To:*user@tika.apache.org <ma...@tika.apache.org>
>>>> *Subject:*Unable to extract content from chunked portion of large file
>>>> 
>>>> Hi All
>>>> 
>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>> content and dump data in Elastic Search for full-text search.
>>>> sending very large files to Tika will cause out of memory exception.
>>>> 
>>>> we want to chunk the file and send it to TIKA for content extraction.
>>>> when we passed chunked portion of file to Tika it is giving empty text.
>>>> I assume Tika is relied on file structure that why it is not giving
>>>> any content.
>>>> 
>>>> we are using Tika Server(REST api) in our .net application.
>>>> 
>>>> please suggest us better approach for this scenario.
>>>> 
>>>> Regards,
>>>> Raghu.





--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr