You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Sean Busbey <bu...@cloudera.com> on 2014/03/17 21:17:46 UTC

Re:

Hi Shaq!

Could you describe your use case in more detail?

Generally, HDFS will behave poorly in the face of many small files. Could
you perhaps colocate several data in one file? This will help both with the
relative overhead of the schema and the pressure on the HDFS NameNode.

-Sean


On Mon, Mar 17, 2014 at 2:55 PM, Salman Haq <sh...@audaxhealth.com>wrote:

> Hello,
>
> I'd like to confirm if there is a recommended way to serialize data to a
> file but without the schema being written in the file metadata. Assume a
> reader's schema will be available for deserialization at a later time.
>
> My use case requires small-sized datum messages to be serialized and
> copied to HDFS. The presence of the schema in the message file adds
> considerable overhead relative to the size of the datum itself.
>
> Thank you,
> Shaq
>
>

Re:

Posted by Salman Haq <sh...@audaxhealth.com>.
Essentially we are instrumenting distributed applications. The instrumented
message format is defined in an Avro schema. The messages are transported
over a message queue (eg: RabbitMQ) or (eventually) over Flume and dumped
into HDFS from where they are loaded into Hive for querying.

In HDFS we can certainly colocate the data into a small number of files.
But I want to know if we can minimize the network bandwidth by generating
valid messages from the client-side but w/o the schema in the header.

Does that make sense?

Shaq


On Mon, Mar 17, 2014 at 4:17 PM, Sean Busbey <bu...@cloudera.com>wrote:

> Hi Shaq!
>
> Could you describe your use case in more detail?
>
> Generally, HDFS will behave poorly in the face of many small files. Could
> you perhaps colocate several data in one file? This will help both with the
> relative overhead of the schema and the pressure on the HDFS NameNode.
>
> -Sean
>
>
> On Mon, Mar 17, 2014 at 2:55 PM, Salman Haq <sh...@audaxhealth.com>wrote:
>
>> Hello,
>>
>> I'd like to confirm if there is a recommended way to serialize data to a
>> file but without the schema being written in the file metadata. Assume a
>> reader's schema will be available for deserialization at a later time.
>>
>> My use case requires small-sized datum messages to be serialized and
>> copied to HDFS. The presence of the schema in the message file adds
>> considerable overhead relative to the size of the datum itself.
>>
>> Thank you,
>> Shaq
>>
>>
>