You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Avi Hayun <av...@gmail.com> on 2014/08/08 08:28:32 UTC

How to identify binary content ?

Hi,

I am crawling my site and am using Tika for binary content parsing.

But, how can I know if a certain url contains binary content or plain text ?

I can get the contentType.


So for now I am using:
if (typeStr.contains("image") || typeStr.contains("audio") ||
typeStr.contains("video") || typeStr.contains("application")) {
 return true;
}


Which is dumb code.

I will replace the plain strings with Tika's MediaType objects but still I
need better code

Does anyone have any better idea ?




Thank you for your help,
Avi

Re: How to identify binary content ?

Posted by Avi Hayun <av...@gmail.com>.
Thank you Jukka very much.

I will use your code.


I will also rethink my whole logic about parsing text.



Avi.


On Thu, Aug 14, 2014 at 4:04 PM, Jukka Zitting <ju...@zitting.name> wrote:

> Hi,
>
> On Sun, Aug 10, 2014 at 1:46 AM, Avi Hayun <av...@gmail.com> wrote:
> > How do I identify content types which can't be read as text (in notepad
> for
> > example) because they have some binary content in them.
>
> You can use use the media type relationship information stored in
> Tika's type registry, like this:
>
>     Tika tika = new Tika();
>     MediaType type = MediaType.parse(tika.detect(...));
>
>     MediaTypeRegistry registry = MediaTypeRegistry.getDefaultRegistry();
>     if (registry.isSpecializationOf(MediaType.TEXT_PLAIN, type)) {
>         // process text
>     } else {
>         // process binary
>     }
>
>
> > [...] if it finds text-parsable content, I want it to take the content
> as it is
>
> Note that consuming text data can be surprisingly difficult given all
> the different character encodings out there. Tika's parser classes
> contain quite a bit of logic for automatically figuring out the
> correct character encoding and other details needed for correctly
> consuming text data.
>
> What's your reason for wanting to process text data separately? Is
> there some missing feature in Tika that would help achieve your use
> case without the need for custom processing of text data?
>
> For example the HTML parser supports the IdentityHtmlMapper feature
> for skipping the HTML simplification that Tika does by default. To
> activate that feature, you can pass an IdentityHtmlMapper instance in
> the parse context:
>
>     ParseContext context = new ParseContext();
>     context.set(HtmlMapper.class, new IdentityHtmlMapper();
>
> --
> Jukka Zitting
>

Re: How to identify binary content ?

Posted by Jukka Zitting <ju...@zitting.name>.
Hi,

On Sun, Aug 10, 2014 at 1:46 AM, Avi Hayun <av...@gmail.com> wrote:
> How do I identify content types which can't be read as text (in notepad for
> example) because they have some binary content in them.

You can use use the media type relationship information stored in
Tika's type registry, like this:

    Tika tika = new Tika();
    MediaType type = MediaType.parse(tika.detect(...));

    MediaTypeRegistry registry = MediaTypeRegistry.getDefaultRegistry();
    if (registry.isSpecializationOf(MediaType.TEXT_PLAIN, type)) {
        // process text
    } else {
        // process binary
    }


> [...] if it finds text-parsable content, I want it to take the content as it is

Note that consuming text data can be surprisingly difficult given all
the different character encodings out there. Tika's parser classes
contain quite a bit of logic for automatically figuring out the
correct character encoding and other details needed for correctly
consuming text data.

What's your reason for wanting to process text data separately? Is
there some missing feature in Tika that would help achieve your use
case without the need for custom processing of text data?

For example the HTML parser supports the IdentityHtmlMapper feature
for skipping the HTML simplification that Tika does by default. To
activate that feature, you can pass an IdentityHtmlMapper instance in
the parse context:

    ParseContext context = new ParseContext();
    context.set(HtmlMapper.class, new IdentityHtmlMapper();

--
Jukka Zitting

Re: How to identify binary content ?

Posted by Avi Hayun <av...@gmail.com>.
Not exactly,

My question is:

How do I identify content types which can't be read as text (in notepad for
example) because they have some binary content in them.


For example:
application/atom+xml - is text for me as it doesn't contain any binary
content
application/json - ditto


But
application/pdf - contains binary content
audio/ogg - contains binary content



My web crawler - crawls the web and if it finds text-parsable content, I
want it to take the content as it is, but if the content contains binary
content I want to take the Tika parsing of it...


On Fri, Aug 8, 2014 at 8:46 PM, Ken Krugler <kk...@transpac.com>
wrote:

> Hi Avi,
>
> Just to clarify, are you asking for some way to determine whether a given
> file (format) will never return any text (other than metadata)?
>
> Thanks,
>
> -- Ken
>
> On Aug 7, 2014, at 11:28pm, Avi Hayun <av...@gmail.com> wrote:
>
> Hi,
>
> I am crawling my site and am using Tika for binary content parsing.
>
> But, how can I know if a certain url contains binary content or plain text
> ?
>
> I can get the contentType.
>
>
> So for now I am using:
> if (typeStr.contains("image") || typeStr.contains("audio") ||
> typeStr.contains("video") || typeStr.contains("application")) {
>  return true;
> }
>
>
> Which is dumb code.
>
> I will replace the plain strings with Tika's MediaType objects but still I
> need better code
>
> Does anyone have any better idea ?
>
>
>
>
> Thank you for your help,
> Avi
>
>
>
>

Re: How to identify binary content ?

Posted by Ken Krugler <kk...@transpac.com>.
Hi Avi,

Just to clarify, are you asking for some way to determine whether a given file (format) will never return any text (other than metadata)?

Thanks,

-- Ken

On Aug 7, 2014, at 11:28pm, Avi Hayun <av...@gmail.com> wrote:

> Hi,
> 
> I am crawling my site and am using Tika for binary content parsing.
> 
> But, how can I know if a certain url contains binary content or plain text ?
> 
> I can get the contentType.
> 
> 
> So for now I am using:
> if (typeStr.contains("image") || typeStr.contains("audio") || typeStr.contains("video") || typeStr.contains("application")) {
> 				return true;
> 			}
> 
> 
> Which is dumb code.
> 
> I will replace the plain strings with Tika's MediaType objects but still I need better code
> 
> Does anyone have any better idea ?
> 
> 
> 
> 
> Thank you for your help,
> Avi