You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Mr Havecamp <mr...@gmail.com> on 2016/10/13 15:29:46 UTC

Get file metadata without retrieving entire file with Tika Server

A while back we contributed a workaround we had for extracting 
metadata/content from remote urls. It wasn't the most ideal way to 
handle extraction of remote files but it meant we could index full text 
from files stored on a completely different server from our JAXRS server.

We're now revisiting this functionality but the size of the files we 
store has increased; in some cases we are storing uncompressed video 
files. Currently, we have two options to extract metadata from these files:

1) is to start the JAXRS server with the enableFileUrl option in the new 
1.14 version and pass urls to Tika Server,

2) Using some kind of wrapper which downloads the file then sends the 
file on to Tika Server for extraction.

However, the problem with either option is that we need to retrieve the 
entire file from storage; this is fine for smaller text files but when 
handling these larger files, it seems wasteful and time-consuming to 
download, say, a video file just to extract the metadata information (we 
wouldn't be indexing the video content).

This is probably more of a question for the dev mailing list but I 
thought I would start my research here to see if anyone has a) 
encountered a similar situation and possible b) has found a potential 
solution.

Thanks


Hayden


Re: Get file metadata without retrieving entire file with Tika Server

Posted by Mr Havecamp <mr...@gmail.com>.
Thanks for the confirmation.

Is this because obtaining information about, for example, the file size 
requires the entire file? I was under the impression that file metadata 
was contained in the x bytes of the file but this is probably something 
I have misunderstood.

Thanks


Hayden


On 13/10/16 23:43, Nick Burch wrote:
> On Thu, 13 Oct 2016, Mr Havecamp wrote:
>> However, the problem with either option is that we need to retrieve 
>> the entire file from storage; this is fine for smaller text files but 
>> when handling these larger files, it seems wasteful and 
>> time-consuming to download, say, a video file just to extract the 
>> metadata information (we wouldn't be indexing the video content).
>
> For a great many file formats, including most video ones, you need the 
> whole file to be able to fully extract all the metadata
>
> Nick


Re: Get file metadata without retrieving entire file with Tika Server

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 13 Oct 2016, Mr Havecamp wrote:
> However, the problem with either option is that we need to retrieve the 
> entire file from storage; this is fine for smaller text files but when 
> handling these larger files, it seems wasteful and time-consuming to 
> download, say, a video file just to extract the metadata information (we 
> wouldn't be indexing the video content).

For a great many file formats, including most video ones, you need the 
whole file to be able to fully extract all the metadata

Nick