You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by John Lilley <jo...@redpoint.net> on 2013/05/23 19:31:14 UTC

splittable vs seekable compressed formats

I've read about splittable compressed formats in Hadoop.  Are any of these formats also "seekable" (in other words, be able to seek to an absolute location in the uncompressed data).
John


RE: splittable vs seekable compressed formats

Posted by John Lilley <jo...@redpoint.net>.
More specifically, seeking to a known location in the uncompressed data.  So not just seeking to “the nearest record boundary”, but seeking to “position 100000000 in the uncompressed data”.  I can see that if the writer kept track of this information on the side it would be available; my question is more about the standard formats (e.g. LZO compression in SequenceFile) supporting this without additional work.
John

From: Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
Sent: Friday, May 24, 2013 1:00 AM
To: user@hadoop.apache.org
Subject: Re: splittable vs seekable compressed formats

Yeah , I think John meant seeking to record boundaries.
Thanks,
Rahul

On Fri, May 24, 2013 at 12:22 PM, Harsh J <ha...@cloudera.com>> wrote:
SequenceFiles should be seekable provided you know/manage their sync
points during writes I think. With LZO this may be non-trivial.

On Thu, May 23, 2013 at 11:01 PM, John Lilley <jo...@redpoint.net>> wrote:
> I’ve read about splittable compressed formats in Hadoop.  Are any of these
> formats also “seekable” (in other words, be able to seek to an absolute
> location in the uncompressed data).
>
> John
>
>


--
Harsh J


RE: splittable vs seekable compressed formats

Posted by John Lilley <jo...@redpoint.net>.
More specifically, seeking to a known location in the uncompressed data.  So not just seeking to “the nearest record boundary”, but seeking to “position 100000000 in the uncompressed data”.  I can see that if the writer kept track of this information on the side it would be available; my question is more about the standard formats (e.g. LZO compression in SequenceFile) supporting this without additional work.
John

From: Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
Sent: Friday, May 24, 2013 1:00 AM
To: user@hadoop.apache.org
Subject: Re: splittable vs seekable compressed formats

Yeah , I think John meant seeking to record boundaries.
Thanks,
Rahul

On Fri, May 24, 2013 at 12:22 PM, Harsh J <ha...@cloudera.com>> wrote:
SequenceFiles should be seekable provided you know/manage their sync
points during writes I think. With LZO this may be non-trivial.

On Thu, May 23, 2013 at 11:01 PM, John Lilley <jo...@redpoint.net>> wrote:
> I’ve read about splittable compressed formats in Hadoop.  Are any of these
> formats also “seekable” (in other words, be able to seek to an absolute
> location in the uncompressed data).
>
> John
>
>


--
Harsh J


RE: splittable vs seekable compressed formats

Posted by John Lilley <jo...@redpoint.net>.
More specifically, seeking to a known location in the uncompressed data.  So not just seeking to “the nearest record boundary”, but seeking to “position 100000000 in the uncompressed data”.  I can see that if the writer kept track of this information on the side it would be available; my question is more about the standard formats (e.g. LZO compression in SequenceFile) supporting this without additional work.
John

From: Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
Sent: Friday, May 24, 2013 1:00 AM
To: user@hadoop.apache.org
Subject: Re: splittable vs seekable compressed formats

Yeah , I think John meant seeking to record boundaries.
Thanks,
Rahul

On Fri, May 24, 2013 at 12:22 PM, Harsh J <ha...@cloudera.com>> wrote:
SequenceFiles should be seekable provided you know/manage their sync
points during writes I think. With LZO this may be non-trivial.

On Thu, May 23, 2013 at 11:01 PM, John Lilley <jo...@redpoint.net>> wrote:
> I’ve read about splittable compressed formats in Hadoop.  Are any of these
> formats also “seekable” (in other words, be able to seek to an absolute
> location in the uncompressed data).
>
> John
>
>


--
Harsh J


RE: splittable vs seekable compressed formats

Posted by John Lilley <jo...@redpoint.net>.
More specifically, seeking to a known location in the uncompressed data.  So not just seeking to “the nearest record boundary”, but seeking to “position 100000000 in the uncompressed data”.  I can see that if the writer kept track of this information on the side it would be available; my question is more about the standard formats (e.g. LZO compression in SequenceFile) supporting this without additional work.
John

From: Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
Sent: Friday, May 24, 2013 1:00 AM
To: user@hadoop.apache.org
Subject: Re: splittable vs seekable compressed formats

Yeah , I think John meant seeking to record boundaries.
Thanks,
Rahul

On Fri, May 24, 2013 at 12:22 PM, Harsh J <ha...@cloudera.com>> wrote:
SequenceFiles should be seekable provided you know/manage their sync
points during writes I think. With LZO this may be non-trivial.

On Thu, May 23, 2013 at 11:01 PM, John Lilley <jo...@redpoint.net>> wrote:
> I’ve read about splittable compressed formats in Hadoop.  Are any of these
> formats also “seekable” (in other words, be able to seek to an absolute
> location in the uncompressed data).
>
> John
>
>


--
Harsh J


Re: splittable vs seekable compressed formats

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Yeah , I think John meant seeking to record boundaries.

Thanks,
Rahul


On Fri, May 24, 2013 at 12:22 PM, Harsh J <ha...@cloudera.com> wrote:

> SequenceFiles should be seekable provided you know/manage their sync
> points during writes I think. With LZO this may be non-trivial.
>
> On Thu, May 23, 2013 at 11:01 PM, John Lilley <jo...@redpoint.net>
> wrote:
> > I’ve read about splittable compressed formats in Hadoop.  Are any of
> these
> > formats also “seekable” (in other words, be able to seek to an absolute
> > location in the uncompressed data).
> >
> > John
> >
> >
>
>
>
> --
> Harsh J
>

Re: splittable vs seekable compressed formats

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Yeah , I think John meant seeking to record boundaries.

Thanks,
Rahul


On Fri, May 24, 2013 at 12:22 PM, Harsh J <ha...@cloudera.com> wrote:

> SequenceFiles should be seekable provided you know/manage their sync
> points during writes I think. With LZO this may be non-trivial.
>
> On Thu, May 23, 2013 at 11:01 PM, John Lilley <jo...@redpoint.net>
> wrote:
> > I’ve read about splittable compressed formats in Hadoop.  Are any of
> these
> > formats also “seekable” (in other words, be able to seek to an absolute
> > location in the uncompressed data).
> >
> > John
> >
> >
>
>
>
> --
> Harsh J
>

Re: splittable vs seekable compressed formats

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Yeah , I think John meant seeking to record boundaries.

Thanks,
Rahul


On Fri, May 24, 2013 at 12:22 PM, Harsh J <ha...@cloudera.com> wrote:

> SequenceFiles should be seekable provided you know/manage their sync
> points during writes I think. With LZO this may be non-trivial.
>
> On Thu, May 23, 2013 at 11:01 PM, John Lilley <jo...@redpoint.net>
> wrote:
> > I’ve read about splittable compressed formats in Hadoop.  Are any of
> these
> > formats also “seekable” (in other words, be able to seek to an absolute
> > location in the uncompressed data).
> >
> > John
> >
> >
>
>
>
> --
> Harsh J
>

Re: splittable vs seekable compressed formats

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Yeah , I think John meant seeking to record boundaries.

Thanks,
Rahul


On Fri, May 24, 2013 at 12:22 PM, Harsh J <ha...@cloudera.com> wrote:

> SequenceFiles should be seekable provided you know/manage their sync
> points during writes I think. With LZO this may be non-trivial.
>
> On Thu, May 23, 2013 at 11:01 PM, John Lilley <jo...@redpoint.net>
> wrote:
> > I’ve read about splittable compressed formats in Hadoop.  Are any of
> these
> > formats also “seekable” (in other words, be able to seek to an absolute
> > location in the uncompressed data).
> >
> > John
> >
> >
>
>
>
> --
> Harsh J
>

Re: splittable vs seekable compressed formats

Posted by Harsh J <ha...@cloudera.com>.
SequenceFiles should be seekable provided you know/manage their sync
points during writes I think. With LZO this may be non-trivial.

On Thu, May 23, 2013 at 11:01 PM, John Lilley <jo...@redpoint.net> wrote:
> I’ve read about splittable compressed formats in Hadoop.  Are any of these
> formats also “seekable” (in other words, be able to seek to an absolute
> location in the uncompressed data).
>
> John
>
>



-- 
Harsh J

Re: splittable vs seekable compressed formats

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
I think seeking is a property of the fs , so any file stored in hdfs is
seekable. Inputstream is seekable and outputstream isn't. FileSystem
supports seekable.

Thanks,
Rahul


On Thu, May 23, 2013 at 11:01 PM, John Lilley <jo...@redpoint.net>wrote:

>  I’ve read about splittable compressed formats in Hadoop.  Are any of
> these formats also “seekable” (in other words, be able to seek to an
> absolute location in the uncompressed data).****
>
> John****
>
> ** **
>

Re: splittable vs seekable compressed formats

Posted by Harsh J <ha...@cloudera.com>.
SequenceFiles should be seekable provided you know/manage their sync
points during writes I think. With LZO this may be non-trivial.

On Thu, May 23, 2013 at 11:01 PM, John Lilley <jo...@redpoint.net> wrote:
> I’ve read about splittable compressed formats in Hadoop.  Are any of these
> formats also “seekable” (in other words, be able to seek to an absolute
> location in the uncompressed data).
>
> John
>
>



-- 
Harsh J

Re: splittable vs seekable compressed formats

Posted by Harsh J <ha...@cloudera.com>.
SequenceFiles should be seekable provided you know/manage their sync
points during writes I think. With LZO this may be non-trivial.

On Thu, May 23, 2013 at 11:01 PM, John Lilley <jo...@redpoint.net> wrote:
> I’ve read about splittable compressed formats in Hadoop.  Are any of these
> formats also “seekable” (in other words, be able to seek to an absolute
> location in the uncompressed data).
>
> John
>
>



-- 
Harsh J

Re: splittable vs seekable compressed formats

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
I think seeking is a property of the fs , so any file stored in hdfs is
seekable. Inputstream is seekable and outputstream isn't. FileSystem
supports seekable.

Thanks,
Rahul


On Thu, May 23, 2013 at 11:01 PM, John Lilley <jo...@redpoint.net>wrote:

>  I’ve read about splittable compressed formats in Hadoop.  Are any of
> these formats also “seekable” (in other words, be able to seek to an
> absolute location in the uncompressed data).****
>
> John****
>
> ** **
>

Re: splittable vs seekable compressed formats

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
I think seeking is a property of the fs , so any file stored in hdfs is
seekable. Inputstream is seekable and outputstream isn't. FileSystem
supports seekable.

Thanks,
Rahul


On Thu, May 23, 2013 at 11:01 PM, John Lilley <jo...@redpoint.net>wrote:

>  I’ve read about splittable compressed formats in Hadoop.  Are any of
> these formats also “seekable” (in other words, be able to seek to an
> absolute location in the uncompressed data).****
>
> John****
>
> ** **
>

Re: splittable vs seekable compressed formats

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
I think seeking is a property of the fs , so any file stored in hdfs is
seekable. Inputstream is seekable and outputstream isn't. FileSystem
supports seekable.

Thanks,
Rahul


On Thu, May 23, 2013 at 11:01 PM, John Lilley <jo...@redpoint.net>wrote:

>  I’ve read about splittable compressed formats in Hadoop.  Are any of
> these formats also “seekable” (in other words, be able to seek to an
> absolute location in the uncompressed data).****
>
> John****
>
> ** **
>

Re: splittable vs seekable compressed formats

Posted by Harsh J <ha...@cloudera.com>.
SequenceFiles should be seekable provided you know/manage their sync
points during writes I think. With LZO this may be non-trivial.

On Thu, May 23, 2013 at 11:01 PM, John Lilley <jo...@redpoint.net> wrote:
> I’ve read about splittable compressed formats in Hadoop.  Are any of these
> formats also “seekable” (in other words, be able to seek to an absolute
> location in the uncompressed data).
>
> John
>
>



-- 
Harsh J