You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Mr Havecamp <mr...@gmail.com> on 2016/10/13 15:29:46 UTC
Get file metadata without retrieving entire file with Tika Server
A while back we contributed a workaround we had for extracting
metadata/content from remote urls. It wasn't the most ideal way to
handle extraction of remote files but it meant we could index full text
from files stored on a completely different server from our JAXRS server.
We're now revisiting this functionality but the size of the files we
store has increased; in some cases we are storing uncompressed video
files. Currently, we have two options to extract metadata from these files:
1) is to start the JAXRS server with the enableFileUrl option in the new
1.14 version and pass urls to Tika Server,
2) Using some kind of wrapper which downloads the file then sends the
file on to Tika Server for extraction.
However, the problem with either option is that we need to retrieve the
entire file from storage; this is fine for smaller text files but when
handling these larger files, it seems wasteful and time-consuming to
download, say, a video file just to extract the metadata information (we
wouldn't be indexing the video content).
This is probably more of a question for the dev mailing list but I
thought I would start my research here to see if anyone has a)
encountered a similar situation and possible b) has found a potential
solution.
Thanks
Hayden
Re: Get file metadata without retrieving entire file with Tika Server
Posted by Mr Havecamp <mr...@gmail.com>.
Thanks for the confirmation.
Is this because obtaining information about, for example, the file size
requires the entire file? I was under the impression that file metadata
was contained in the x bytes of the file but this is probably something
I have misunderstood.
Thanks
Hayden
On 13/10/16 23:43, Nick Burch wrote:
> On Thu, 13 Oct 2016, Mr Havecamp wrote:
>> However, the problem with either option is that we need to retrieve
>> the entire file from storage; this is fine for smaller text files but
>> when handling these larger files, it seems wasteful and
>> time-consuming to download, say, a video file just to extract the
>> metadata information (we wouldn't be indexing the video content).
>
> For a great many file formats, including most video ones, you need the
> whole file to be able to fully extract all the metadata
>
> Nick
Re: Get file metadata without retrieving entire file with Tika
Server
Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 13 Oct 2016, Mr Havecamp wrote:
> However, the problem with either option is that we need to retrieve the
> entire file from storage; this is fine for smaller text files but when
> handling these larger files, it seems wasteful and time-consuming to
> download, say, a video file just to extract the metadata information (we
> wouldn't be indexing the video content).
For a great many file formats, including most video ones, you need the
whole file to be able to fully extract all the metadata
Nick