You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Brian Young <bw...@gmail.com> on 2016/03/25 22:07:27 UTC

file-related metadata

Hello,

I'm having an issue where I'm getting back two or three metadata properties
that are related to a temp file that tika is apparently creating under the
hood:

File Modified Date (the current date)
File Name (temp file name: apache-tika-3021300783416279997.tmp)
File Size

I assume this is because I only have a stream to give Tika and no longer
have a physical file.  However the users are seeing these (particularly the
modified date) and misinterpreting it.

I'd like to exclude these, which I could of course do by just a
string-based filter.  However that feels a little hackish... I was hoping
there may be some way to deactivate file metadata if Tika is the one that
created the temp file?  I tried to find the spot in Tika where these are
being added by greping all the source but I seem to have come up empty for
some reason.

Thanks for any pointers,
Brian

Re: file-related metadata

Posted by Brian Young <bw...@gmail.com>.
So, after some debugging I discovered the root cause.  The image metadata
extractor is producing properties for "file modified date", "file name" and
"file size."

Unfortunately as mentioned in my original post, the file information is at
times misleading since it can reflect a tika temp file.   However I'm not
sure there is a clean or painless way to exclude these from the
ImageMetadataExtractor without basically replacing that parser with a
subclass?

So, the easiest fix might be for me to remove these properties as a post
operation after the metadata is extracted.  Not as clean as I would like
but it should work.

Thanks,
Brian

On Fri, Mar 25, 2016 at 5:07 PM, Brian Young <bw...@gmail.com> wrote:

> Hello,
>
> I'm having an issue where I'm getting back two or three metadata
> properties that are related to a temp file that tika is apparently creating
> under the hood:
>
> File Modified Date (the current date)
> File Name (temp file name: apache-tika-3021300783416279997.tmp)
> File Size
>
> I assume this is because I only have a stream to give Tika and no longer
> have a physical file.  However the users are seeing these (particularly the
> modified date) and misinterpreting it.
>
> I'd like to exclude these, which I could of course do by just a
> string-based filter.  However that feels a little hackish... I was hoping
> there may be some way to deactivate file metadata if Tika is the one that
> created the temp file?  I tried to find the spot in Tika where these are
> being added by greping all the source but I seem to have come up empty for
> some reason.
>
> Thanks for any pointers,
> Brian
>
>
>
>