You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@community.apache.org by 朱桂锋 <pp...@gmail.com> on 2023/03/17 14:02:23 UTC

tika-offheap-leak

Firstly,  thank you for tika project, she is great project!

Recently, i run the tika project and extract text from document, i find
java offheap is increasing until all the memory to the 100%, and then
killed by oom-killer.

then i use pmap and dump data from memory(exclude the java heap), i find
they are like this:

[ Content

Types] . xM1PK

rels/.relsPK word/ rels/document.xm1.relsPK word /document.xm1PK
word/footer4.xmIPK word/header4. xm1PK word/footer2.xmIPK word/header2.
xm1PK word /header3.xmIPK word/footer3.xmlPK word /header1.xm1PK

word/ footer1 . xm1PK

word / footnotes.xmlPK word/endnotes .xm1PK word/header5. xm1PK word/media/
image3.pngPK word/media/imagel. jpegPK word/media/image2. jpegPK word /
theme/ theme 1. xm1PK word/settings. xm1PK

customxml/ itemProps2 .xm1PK

customXml /item2 . xm1PK docProps /custom. xm1 PK t?92
customXml/rels/item1.xm1.relsPK customXml/ rels/item2.xm1.relsPK customXm1
/itemProps1.xm1PK



they are office document text,why they are in offheap?  so i doubt when
parse some special  office  document  it will cause memory leak.

And sorry  i don't know what the special office document and i  can't
afford the sample.


another infomation:  when i debug code on my own mac computer, using xlsx
sample ,
when it calling tika.detect, it called ZipArchiveInputStream constructor
twice, and the same times calling java.util.zip.Inflater#end();
but when it calling tika.parseToString,  it called ZipArchiveInputStream
constructor once, but no times calling java.util.zip.Inflater#end();

Is that caused the offheap memory leak because of the Inflater use native
code?

Look forward for your reply!  thank you very much!

Re: tika-offheap-leak

Posted by 朱桂锋 <pp...@gmail.com>.
Thank you for the clarification.

<rb...@rcbowen.com> 于2023年3月17日周五 22:45写道:

> Hi.
>
> Unfortunately, you've reached the wrong list. This is a general
> community list for the Apache Software Foundation as a whole. Tika has
> its own lists, which you can find at
> https://tika.apache.org/mail-lists.html and that's where the people
> that can help with this hang out.
>
> --Rich
>
> On Fri, 2023-03-17 at 22:02 +0800, 朱桂锋 wrote:
> > Firstly,  thank you for tika project, she is great project!
> >
> > Recently, i run the tika project and extract text from document, i
> > find
> > java offheap is increasing until all the memory to the 100%, and then
> > killed by oom-killer.
> >
> > then i use pmap and dump data from memory(exclude the java heap), i
> > find
> > they are like this:
> >
> > [ Content
> >
> > Types] . xM1PK
> >
> > rels/.relsPK word/ rels/document.xm1.relsPK word /document.xm1PK
> > word/footer4.xmIPK word/header4. xm1PK word/footer2.xmIPK
> > word/header2.
> > xm1PK word /header3.xmIPK word/footer3.xmlPK word /header1.xm1PK
> >
> > word/ footer1 . xm1PK
> >
> > word / footnotes.xmlPK word/endnotes .xm1PK word/header5. xm1PK
> > word/media/
> > image3.pngPK word/media/imagel. jpegPK word/media/image2. jpegPK word
> > /
> > theme/ theme 1. xm1PK word/settings. xm1PK
> >
> > customxml/ itemProps2 .xm1PK
> >
> > customXml /item2 . xm1PK docProps /custom. xm1 PK t?92
> > customXml/rels/item1.xm1.relsPK customXml/ rels/item2.xm1.relsPK
> > customXm1
> > /itemProps1.xm1PK
> >
> >
> >
> > they are office document text,why they are in offheap?  so i doubt
> > when
> > parse some special  office  document  it will cause memory leak.
> >
> > And sorry  i don't know what the special office document and i  can't
> > afford the sample.
> >
> >
> > another infomation:  when i debug code on my own mac computer, using
> > xlsx
> > sample ,
> > when it calling tika.detect, it called ZipArchiveInputStream
> > constructor
> > twice, and the same times calling java.util.zip.Inflater#end();
> > but when it calling tika.parseToString,  it called
> > ZipArchiveInputStream
> > constructor once, but no times calling java.util.zip.Inflater#end();
> >
> > Is that caused the offheap memory leak because of the Inflater use
> > native
> > code?
> >
> > Look forward for your reply!  thank you very much!
>
>

Re: tika-offheap-leak

Posted by rb...@rcbowen.com.
Hi.

Unfortunately, you've reached the wrong list. This is a general
community list for the Apache Software Foundation as a whole. Tika has
its own lists, which you can find at
https://tika.apache.org/mail-lists.html and that's where the people
that can help with this hang out.

--Rich

On Fri, 2023-03-17 at 22:02 +0800, 朱桂锋 wrote:
> Firstly,  thank you for tika project, she is great project!
> 
> Recently, i run the tika project and extract text from document, i
> find
> java offheap is increasing until all the memory to the 100%, and then
> killed by oom-killer.
> 
> then i use pmap and dump data from memory(exclude the java heap), i
> find
> they are like this:
> 
> [ Content
> 
> Types] . xM1PK
> 
> rels/.relsPK word/ rels/document.xm1.relsPK word /document.xm1PK
> word/footer4.xmIPK word/header4. xm1PK word/footer2.xmIPK
> word/header2.
> xm1PK word /header3.xmIPK word/footer3.xmlPK word /header1.xm1PK
> 
> word/ footer1 . xm1PK
> 
> word / footnotes.xmlPK word/endnotes .xm1PK word/header5. xm1PK
> word/media/
> image3.pngPK word/media/imagel. jpegPK word/media/image2. jpegPK word
> /
> theme/ theme 1. xm1PK word/settings. xm1PK
> 
> customxml/ itemProps2 .xm1PK
> 
> customXml /item2 . xm1PK docProps /custom. xm1 PK t?92
> customXml/rels/item1.xm1.relsPK customXml/ rels/item2.xm1.relsPK
> customXm1
> /itemProps1.xm1PK
> 
> 
> 
> they are office document text,why they are in offheap?  so i doubt
> when
> parse some special  office  document  it will cause memory leak.
> 
> And sorry  i don't know what the special office document and i  can't
> afford the sample.
> 
> 
> another infomation:  when i debug code on my own mac computer, using
> xlsx
> sample ,
> when it calling tika.detect, it called ZipArchiveInputStream
> constructor
> twice, and the same times calling java.util.zip.Inflater#end();
> but when it calling tika.parseToString,  it called
> ZipArchiveInputStream
> constructor once, but no times calling java.util.zip.Inflater#end();
> 
> Is that caused the offheap memory leak because of the Inflater use
> native
> code?
> 
> Look forward for your reply!  thank you very much!


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@community.apache.org
For additional commands, e-mail: dev-help@community.apache.org