You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Van Tassell, Kristian" <kr...@siemens.com> on 2018/01/11 18:41:43 UTC

RE: Parse file without creating tmp file

Apologies for bumping such an old thread, but is there an official list somewhere of those filetypes that require the temporary file being created?

Thanks!

-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org] 
Sent: Tuesday, July 11, 2017 4:23 AM
To: user@tika.apache.org
Subject: Re: Parse file without creating tmp file

On Tue, 11 Jul 2017, aravinth thangasami wrote:
> Recently I have noticed tika creates a tmp file in before parsing the 
> stream.

Only for certain formats, generally where the underlying parsing library requires a file for random-access

> I don't have much experience in Tika but I feel it is an overhead.
> Can we achieve file parsing without writing to tmp file?

For some files, no, not without re-writing other open source libraries

For most, it isn't needed and Tika won't do it

Nick

Re: Parse file without creating tmp file

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 12 Jan 2018, Luís Filipe Nassif wrote:
> I can list some of them currently needing temp files: jpeg, zip (for
> detection) and derived (docx, xlsx, pptx), ole2 (for detection) and derived
> (doc, xls, ppt), mdb, pst, rar, 7zip, sqlite...

I've had a go at recording this in the Wiki, along with some general 
guideance on Files and Tika:
https://wiki.apache.org/tika/TikaWithoutFiles

Please tweak as needed, and update for future versions as you come across 
changes! :)

Nick

Re: Parse file without creating tmp file

Posted by Luís Filipe Nassif <lf...@gmail.com>.
I can list some of them currently needing temp files: jpeg, zip (for
detection) and derived (docx, xlsx, pptx), ole2 (for detection) and derived
(doc, xls, ppt), mdb, pst, rar, 7zip, sqlite...

But quoting Tim Allison, that can change depending on dependencies. For
example, in the past PDF needed temp files, in recent versions it was
stored in memory, now it is configurable...

Luis

2018-01-11 18:02 GMT-02:00 Allison, Timothy B. <ta...@mitre.org>:

> I'm not aware of such a list.  Part of the challenge is that we don't know
> when our dependencies might choose to create a temp file.
>
> Sorry!
>
> -----Original Message-----
> From: Van Tassell, Kristian [mailto:kristian.vantassell@siemens.com]
> Sent: Thursday, January 11, 2018 1:42 PM
> To: user@tika.apache.org
> Subject: RE: Parse file without creating tmp file
>
> Apologies for bumping such an old thread, but is there an official list
> somewhere of those filetypes that require the temporary file being created?
>
> Thanks!
>
> -----Original Message-----
> From: Nick Burch [mailto:apache@gagravarr.org]
> Sent: Tuesday, July 11, 2017 4:23 AM
> To: user@tika.apache.org
> Subject: Re: Parse file without creating tmp file
>
> On Tue, 11 Jul 2017, aravinth thangasami wrote:
> > Recently I have noticed tika creates a tmp file in before parsing the
> > stream.
>
> Only for certain formats, generally where the underlying parsing library
> requires a file for random-access
>
> > I don't have much experience in Tika but I feel it is an overhead.
> > Can we achieve file parsing without writing to tmp file?
>
> For some files, no, not without re-writing other open source libraries
>
> For most, it isn't needed and Tika won't do it
>
> Nick
>
>

RE: Parse file without creating tmp file

Posted by "Allison, Timothy B." <ta...@mitre.org>.
I'm not aware of such a list.  Part of the challenge is that we don't know when our dependencies might choose to create a temp file.

Sorry!

-----Original Message-----
From: Van Tassell, Kristian [mailto:kristian.vantassell@siemens.com] 
Sent: Thursday, January 11, 2018 1:42 PM
To: user@tika.apache.org
Subject: RE: Parse file without creating tmp file

Apologies for bumping such an old thread, but is there an official list somewhere of those filetypes that require the temporary file being created?

Thanks!

-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org]
Sent: Tuesday, July 11, 2017 4:23 AM
To: user@tika.apache.org
Subject: Re: Parse file without creating tmp file

On Tue, 11 Jul 2017, aravinth thangasami wrote:
> Recently I have noticed tika creates a tmp file in before parsing the 
> stream.

Only for certain formats, generally where the underlying parsing library requires a file for random-access

> I don't have much experience in Tika but I feel it is an overhead.
> Can we achieve file parsing without writing to tmp file?

For some files, no, not without re-writing other open source libraries

For most, it isn't needed and Tika won't do it

Nick