You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Kudrettin Güleryüz <ku...@gmail.com> on 2018/01/11 14:03:42 UTC

Binary file check

Hi,

Does Tika library provide an efficient binary file check?

For a source code search engine I'd like to provide support for indexing
all text files (including non-source-code text files) and exclude files
that are binary.
Files are read over NFS, therefore a binary check with minimum file content
read would be preferable if possible at all.

Thank you,
Kudret

Re: Binary file check

Posted by Julian Reschke <ju...@gmx.de>.
On 2018-01-19 16:30, Kudrettin Güleryüz wrote:
> One more thing, regarding application/xml vs text/xml
> I think I'll skip application/xml for now and just include text/xml
> 
> Assuming application/xml is compressed XML such as Open office documents 
> and text/xml as uncompressed XML

application/xml is not compressed.

Best regards, Julian

Re: Binary file check

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 19 Jan 2018, Kudrettin Güleryüz wrote:
> One more thing, regarding application/xml vs text/xml
> I think I'll skip application/xml for now and just include text/xml
>
> Assuming application/xml is compressed XML such as Open office documents
> and text/xml as uncompressed XML

Nope! They're both uncompressed textual XML!

Generally though, when defining a new xml-based filetype, the spec authors 
decide if it's going to be vaguely readable-editable or opaque, then pick 
if they go for text/xml or application/xml as the parent type. Can be a 
bit random which they go for though! See 
https://stackoverflow.com/a/4832418/685641 for a bit more info and some 
references

Nick

Re: Binary file check

Posted by Kudrettin Güleryüz <ku...@gmail.com>.
One more thing, regarding application/xml vs text/xml
I think I'll skip application/xml for now and just include text/xml

Assuming application/xml is compressed XML such as Open office documents
and text/xml as uncompressed XML


On Fri, Jan 19, 2018 at 10:23 AM Kudrettin Güleryüz <ku...@gmail.com>
wrote:

> This is a source code search engine, some of the users here are like some
> humans over there :) So, yes XML files, source code files are human
> readable in my definition. But I think I get your point: Rather than
> detecting binary or not, decide which mime-types to allow and use tika to
> get mime type of files in runtime when traversing the file system for
> indexing.
>
> Thank you
>
>
>
> On Mon, Jan 15, 2018 at 2:25 AM Nick Burch <ap...@gagravarr.org> wrote:
>
>> On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote:
>> > I am not an expert on mime types and how they extend.  My definition of
>> > binary is any file that is not in human readable form. Any other file,
>> > I'd like to index. Would that answer your question?
>>
>> Some of us humans here can read a wide range of formats than others,
>> especially if we go slowly... ;)
>>
>> For now, I'd suggest you start with:
>>   * Does the mimetype start with text/ ?
>>   * If not, check all parents (supertypes) to see if any of those start
>>     with text/
>>
>> Then:
>>   * Try a few formats with a parent of application/xml, and see if you
>> want
>>     to include or exclude those (are they human readable enough?)
>>   * Try a few formats with a parent of text/xml or text/html, and see if
>>     you want to include or exclude them (ditto on really human readable)
>>
>> Use
>> https://tika.apache.org/1.17/api/org/apache/tika/mime/MediaTypeRegistry.html#getSupertype-org.apache.tika.mime.MediaType-
>> to get the parent types
>>
>> Use
>> http://tika.apache.org/1.17/api/org/apache/tika/mime/MediaType.html#getType--
>> to check if a mimetype if text/ or not (check for
>> getType().equals("text"))
>>
>> Nick
>
>

Re: Binary file check

Posted by Kudrettin Güleryüz <ku...@gmail.com>.
This is a source code search engine, some of the users here are like some
humans over there :) So, yes XML files, source code files are human
readable in my definition. But I think I get your point: Rather than
detecting binary or not, decide which mime-types to allow and use tika to
get mime type of files in runtime when traversing the file system for
indexing.

Thank you



On Mon, Jan 15, 2018 at 2:25 AM Nick Burch <ap...@gagravarr.org> wrote:

> On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote:
> > I am not an expert on mime types and how they extend.  My definition of
> > binary is any file that is not in human readable form. Any other file,
> > I'd like to index. Would that answer your question?
>
> Some of us humans here can read a wide range of formats than others,
> especially if we go slowly... ;)
>
> For now, I'd suggest you start with:
>   * Does the mimetype start with text/ ?
>   * If not, check all parents (supertypes) to see if any of those start
>     with text/
>
> Then:
>   * Try a few formats with a parent of application/xml, and see if you want
>     to include or exclude those (are they human readable enough?)
>   * Try a few formats with a parent of text/xml or text/html, and see if
>     you want to include or exclude them (ditto on really human readable)
>
> Use
> https://tika.apache.org/1.17/api/org/apache/tika/mime/MediaTypeRegistry.html#getSupertype-org.apache.tika.mime.MediaType-
> to get the parent types
>
> Use
> http://tika.apache.org/1.17/api/org/apache/tika/mime/MediaType.html#getType--
> to check if a mimetype if text/ or not (check for getType().equals("text"))
>
> Nick

Re: Binary file check

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote:
> I am not an expert on mime types and how they extend.  My definition of 
> binary is any file that is not in human readable form. Any other file, 
> I'd like to index. Would that answer your question?

Some of us humans here can read a wide range of formats than others, 
especially if we go slowly... ;)

For now, I'd suggest you start with:
  * Does the mimetype start with text/ ?
  * If not, check all parents (supertypes) to see if any of those start
    with text/

Then:
  * Try a few formats with a parent of application/xml, and see if you want
    to include or exclude those (are they human readable enough?)
  * Try a few formats with a parent of text/xml or text/html, and see if
    you want to include or exclude them (ditto on really human readable)

Use https://tika.apache.org/1.17/api/org/apache/tika/mime/MediaTypeRegistry.html#getSupertype-org.apache.tika.mime.MediaType-
to get the parent types

Use http://tika.apache.org/1.17/api/org/apache/tika/mime/MediaType.html#getType--
to check if a mimetype if text/ or not (check for getType().equals("text"))

Nick

Re: Binary file check

Posted by Kudrettin Güleryüz <ku...@gmail.com>.
I am not an expert on mime types and how they extend.  My definition of
binary is any file that is not in human readable form. Any other file, I'd
like to index. Would that answer your question?


On Thu, Jan 11, 2018 at 10:01 AM Nick Burch <ap...@gagravarr.org> wrote:

> On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote:
> > Does Tika library provide an efficient binary file check?
>
> How do you define "binary"?
>
> Only things with a mimetype that starts text/ ? Or do you want to include
> application/xml files? Or things that extend form XML like DIF and
> FictionBook? Only things that contain ascii-printable characters? Other?
>
> We need to know your definition of binary to be able to suggest!
>
> Nick

Re: Binary file check

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote:
> Does Tika library provide an efficient binary file check?

How do you define "binary"?

Only things with a mimetype that starts text/ ? Or do you want to include 
application/xml files? Or things that extend form XML like DIF and 
FictionBook? Only things that contain ascii-printable characters? Other?

We need to know your definition of binary to be able to suggest!

Nick