You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Josh Spiegel <jo...@gmail.com> on 2013/01/23 22:09:45 UTC

Synchronization Markers

As I understand it, Avro container files contain synchronization markers
every so often to support splitting the file.  See:
https://cwiki.apache.org/AVRO/faq.html#FAQ-Whatisthepurposeofthesyncmarkerintheobjectfileformat%3F

(1) Why isn't the synchronization marker the same for every container
file?  (i.e. what is the point of generating it randomly every time)

(2) Is it possible, at least in theory, for naturally occurring data to
contain bytes that match the sync marker? If so, would this break
synchronization?

Thanks,
Josh

Re: Synchronization Markers

Posted by Josh Spiegel <jo...@gmail.com>.
Ok, makes sense.  Thanks for the answer.


On Thu, Jan 24, 2013 at 4:47 AM, Martin Kleppmann <ma...@rapportive.com>wrote:

> 1. Because if it was predictable, it would inevitably appear in the
> actual data sometimes (e.g. imagine the Avro documentation, stating
> what the sync marker is, is downloaded by a web crawler and stored in
> an Avro data file; then the sync marker will appear in the actual
> data). Data may come from malicious sources; making the marker random
> makes it unfeasible to exploit.
>
> 2. Possibly, but extremely unlikely. The probability of a given random
> 16-byte string appearing in a petabyte of (uniformly distributed) data
> is about 10^-23. It's more likely that your data center is wiped out
> by a meteorite (http://preshing.com/20110504/hash-collision-probabilities
> ).
>
> 3. If the sync marker appears in your data, it only breaks reading the
> file if you happen to also seek to that place in the file. If you just
> read over it sequentially, nothing happens.
>
> Martin
>
> On 23 January 2013 21:09, Josh Spiegel <jo...@gmail.com> wrote:
> > As I understand it, Avro container files contain synchronization markers
> > every so often to support splitting the file.  See:
> >
> https://cwiki.apache.org/AVRO/faq.html#FAQ-Whatisthepurposeofthesyncmarkerintheobjectfileformat%3F
> >
> > (1) Why isn't the synchronization marker the same for every container
> file?
> > (i.e. what is the point of generating it randomly every time)
> >
> > (2) Is it possible, at least in theory, for naturally occurring data to
> > contain bytes that match the sync marker? If so, would this break
> > synchronization?
> >
> > Thanks,
> > Josh
>

Re: Synchronization Markers

Posted by Martin Kleppmann <ma...@rapportive.com>.
1. Because if it was predictable, it would inevitably appear in the
actual data sometimes (e.g. imagine the Avro documentation, stating
what the sync marker is, is downloaded by a web crawler and stored in
an Avro data file; then the sync marker will appear in the actual
data). Data may come from malicious sources; making the marker random
makes it unfeasible to exploit.

2. Possibly, but extremely unlikely. The probability of a given random
16-byte string appearing in a petabyte of (uniformly distributed) data
is about 10^-23. It's more likely that your data center is wiped out
by a meteorite (http://preshing.com/20110504/hash-collision-probabilities).

3. If the sync marker appears in your data, it only breaks reading the
file if you happen to also seek to that place in the file. If you just
read over it sequentially, nothing happens.

Martin

On 23 January 2013 21:09, Josh Spiegel <jo...@gmail.com> wrote:
> As I understand it, Avro container files contain synchronization markers
> every so often to support splitting the file.  See:
> https://cwiki.apache.org/AVRO/faq.html#FAQ-Whatisthepurposeofthesyncmarkerintheobjectfileformat%3F
>
> (1) Why isn't the synchronization marker the same for every container file?
> (i.e. what is the point of generating it randomly every time)
>
> (2) Is it possible, at least in theory, for naturally occurring data to
> contain bytes that match the sync marker? If so, would this break
> synchronization?
>
> Thanks,
> Josh