You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Public Network Services <pu...@gmail.com> on 2012/07/20 09:36:01 UTC

AVRO content detection

Does Tika support AVRO format detection?

AVRO files come either in JSON or their own binary format. for the JSON
case probably Tika would return "text/plain", but what about the binayr
case?

Re: AVRO content detection

Posted by Public Network Services <pu...@gmail.com>.
Any suggestions?


---------- Forwarded message ----------
From: Public Network Services <pu...@gmail.com>
Date: Fri, Jul 20, 2012 at 10:36 AM
Subject: AVRO content detection
To: user@tika.apache.org


Does Tika support AVRO format detection?

AVRO files come either in JSON or their own binary format. for the JSON
case probably Tika would return "text/plain", but what about the binayr
case?

Re: AVRO content detection

Posted by Public Network Services <pu...@gmail.com>.
AVRO can be text (with some binary characters), or binary.

In general, there is a schema encoded in JSON, which is embedded in the
data. The

   - Apache Avro 1.7.1
specification<http://avro.apache.org/docs/current/spec.html> from
   the Apache Avro website <http://avro.apache.org/>

in its "Data Serialization" section, explicity mandates that

*Avro data is always serialized with its schema. **Files that store Avro
data should always also include the schema for that data in the same file. *


Immediately afterwards in the same document, the "Encodings" section reads

*Avro specifies two serialization encodings: binary and JSON. Most
applications will use the binary encoding, as it is smaller and faster.
But, for debugging and web-based applications, the JSON encoding may
sometimes be appropriate.*


The file structure is described a bit later, in the "Object Container
Files" section, which, among other things, specifies that

*A file has a schema, and all objects stored in the file must be written
according to that schema, using binary encoding.*
*
*
*A file consists of:*

   - *A file header, followed by*
   - *one or more file data blocks.*

*A file header consists of:*

   - *Four bytes, ASCII 'O', 'b', 'j', followed by 1.*
   - *file metadata, including the schema.*
   - *The 16-byte, randomly-generated sync marker for this file.*

Searching the web, apparently there are not many public Avro datasets
available, but one such is:

The Berkeley Enron
E-mails<http://hortonworks.com/blog/the-data-lifecycle-part-one-avroizing-the-enron-emails/>


which you can directly download, in a single file (of approximately 450 MB)
from the "here" link at the "Conclusion" section at the end of the page.

If you download that dataset, you will see that it starts with "Obj" but
not "Obj1" and immediately after the first 3 bytes ("Obj") it has 3
"binary" characters (SOH, STX, SYN).

So, detection could be based on some search for 3 magic bytes ('O', 'b',
'j') at the start, plus the presence of some extra characters like the
3 "binary" ones I mentioned above.


On Fri, Jul 20, 2012 at 9:47 PM, Nick Burch <ni...@alfresco.com> wrote:

> On Fri, 20 Jul 2012, Public Network Services wrote:
>
>> Does Tika support AVRO format detection?
>>
>
> I don't think so - I can't see any references to it in the code
>
>
>  AVRO files come either in JSON or their own binary format. for the JSON
>> case probably Tika would return "text/plain", but what about the binayr
>> case?
>>
>
> Are you able to dig out some references (probably on the Avro website)
> about the file formats? Some small sample files would be handy. Also,
> what's the usual mimetype used for these files?
>
> Nick
>

Re: AVRO content detection

Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 20 Jul 2012, Public Network Services wrote:
> Does Tika support AVRO format detection?

I don't think so - I can't see any references to it in the code

> AVRO files come either in JSON or their own binary format. for the JSON
> case probably Tika would return "text/plain", but what about the binayr
> case?

Are you able to dig out some references (probably on the Avro website) 
about the file formats? Some small sample files would be handy. Also, 
what's the usual mimetype used for these files?

Nick