You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Kudrettin Güleryüz <ku...@gmail.com> on 2018/01/11 14:03:42 UTC
Binary file check
Hi,
Does Tika library provide an efficient binary file check?
For a source code search engine I'd like to provide support for indexing
all text files (including non-source-code text files) and exclude files
that are binary.
Files are read over NFS, therefore a binary check with minimum file content
read would be preferable if possible at all.
Thank you,
Kudret
Re: Binary file check
Posted by Julian Reschke <ju...@gmx.de>.
On 2018-01-19 16:30, Kudrettin Güleryüz wrote:
> One more thing, regarding application/xml vs text/xml
> I think I'll skip application/xml for now and just include text/xml
>
> Assuming application/xml is compressed XML such as Open office documents
> and text/xml as uncompressed XML
application/xml is not compressed.
Best regards, Julian
Re: Binary file check
Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 19 Jan 2018, Kudrettin Güleryüz wrote:
> One more thing, regarding application/xml vs text/xml
> I think I'll skip application/xml for now and just include text/xml
>
> Assuming application/xml is compressed XML such as Open office documents
> and text/xml as uncompressed XML
Nope! They're both uncompressed textual XML!
Generally though, when defining a new xml-based filetype, the spec authors
decide if it's going to be vaguely readable-editable or opaque, then pick
if they go for text/xml or application/xml as the parent type. Can be a
bit random which they go for though! See
https://stackoverflow.com/a/4832418/685641 for a bit more info and some
references
Nick
Re: Binary file check
Posted by Kudrettin Güleryüz <ku...@gmail.com>.
One more thing, regarding application/xml vs text/xml
I think I'll skip application/xml for now and just include text/xml
Assuming application/xml is compressed XML such as Open office documents
and text/xml as uncompressed XML
On Fri, Jan 19, 2018 at 10:23 AM Kudrettin Güleryüz <ku...@gmail.com>
wrote:
> This is a source code search engine, some of the users here are like some
> humans over there :) So, yes XML files, source code files are human
> readable in my definition. But I think I get your point: Rather than
> detecting binary or not, decide which mime-types to allow and use tika to
> get mime type of files in runtime when traversing the file system for
> indexing.
>
> Thank you
>
>
>
> On Mon, Jan 15, 2018 at 2:25 AM Nick Burch <ap...@gagravarr.org> wrote:
>
>> On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote:
>> > I am not an expert on mime types and how they extend. My definition of
>> > binary is any file that is not in human readable form. Any other file,
>> > I'd like to index. Would that answer your question?
>>
>> Some of us humans here can read a wide range of formats than others,
>> especially if we go slowly... ;)
>>
>> For now, I'd suggest you start with:
>> * Does the mimetype start with text/ ?
>> * If not, check all parents (supertypes) to see if any of those start
>> with text/
>>
>> Then:
>> * Try a few formats with a parent of application/xml, and see if you
>> want
>> to include or exclude those (are they human readable enough?)
>> * Try a few formats with a parent of text/xml or text/html, and see if
>> you want to include or exclude them (ditto on really human readable)
>>
>> Use
>> https://tika.apache.org/1.17/api/org/apache/tika/mime/MediaTypeRegistry.html#getSupertype-org.apache.tika.mime.MediaType-
>> to get the parent types
>>
>> Use
>> http://tika.apache.org/1.17/api/org/apache/tika/mime/MediaType.html#getType--
>> to check if a mimetype if text/ or not (check for
>> getType().equals("text"))
>>
>> Nick
>
>
Re: Binary file check
Posted by Kudrettin Güleryüz <ku...@gmail.com>.
This is a source code search engine, some of the users here are like some
humans over there :) So, yes XML files, source code files are human
readable in my definition. But I think I get your point: Rather than
detecting binary or not, decide which mime-types to allow and use tika to
get mime type of files in runtime when traversing the file system for
indexing.
Thank you
On Mon, Jan 15, 2018 at 2:25 AM Nick Burch <ap...@gagravarr.org> wrote:
> On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote:
> > I am not an expert on mime types and how they extend. My definition of
> > binary is any file that is not in human readable form. Any other file,
> > I'd like to index. Would that answer your question?
>
> Some of us humans here can read a wide range of formats than others,
> especially if we go slowly... ;)
>
> For now, I'd suggest you start with:
> * Does the mimetype start with text/ ?
> * If not, check all parents (supertypes) to see if any of those start
> with text/
>
> Then:
> * Try a few formats with a parent of application/xml, and see if you want
> to include or exclude those (are they human readable enough?)
> * Try a few formats with a parent of text/xml or text/html, and see if
> you want to include or exclude them (ditto on really human readable)
>
> Use
> https://tika.apache.org/1.17/api/org/apache/tika/mime/MediaTypeRegistry.html#getSupertype-org.apache.tika.mime.MediaType-
> to get the parent types
>
> Use
> http://tika.apache.org/1.17/api/org/apache/tika/mime/MediaType.html#getType--
> to check if a mimetype if text/ or not (check for getType().equals("text"))
>
> Nick
Re: Binary file check
Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote:
> I am not an expert on mime types and how they extend. My definition of
> binary is any file that is not in human readable form. Any other file,
> I'd like to index. Would that answer your question?
Some of us humans here can read a wide range of formats than others,
especially if we go slowly... ;)
For now, I'd suggest you start with:
* Does the mimetype start with text/ ?
* If not, check all parents (supertypes) to see if any of those start
with text/
Then:
* Try a few formats with a parent of application/xml, and see if you want
to include or exclude those (are they human readable enough?)
* Try a few formats with a parent of text/xml or text/html, and see if
you want to include or exclude them (ditto on really human readable)
Use https://tika.apache.org/1.17/api/org/apache/tika/mime/MediaTypeRegistry.html#getSupertype-org.apache.tika.mime.MediaType-
to get the parent types
Use http://tika.apache.org/1.17/api/org/apache/tika/mime/MediaType.html#getType--
to check if a mimetype if text/ or not (check for getType().equals("text"))
Nick
Re: Binary file check
Posted by Kudrettin Güleryüz <ku...@gmail.com>.
I am not an expert on mime types and how they extend. My definition of
binary is any file that is not in human readable form. Any other file, I'd
like to index. Would that answer your question?
On Thu, Jan 11, 2018 at 10:01 AM Nick Burch <ap...@gagravarr.org> wrote:
> On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote:
> > Does Tika library provide an efficient binary file check?
>
> How do you define "binary"?
>
> Only things with a mimetype that starts text/ ? Or do you want to include
> application/xml files? Or things that extend form XML like DIF and
> FictionBook? Only things that contain ascii-printable characters? Other?
>
> We need to know your definition of binary to be able to suggest!
>
> Nick
Re: Binary file check
Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote:
> Does Tika library provide an efficient binary file check?
How do you define "binary"?
Only things with a mimetype that starts text/ ? Or do you want to include
application/xml files? Or things that extend form XML like DIF and
FictionBook? Only things that contain ascii-printable characters? Other?
We need to know your definition of binary to be able to suggest!
Nick