You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Mark <st...@gmail.com> on 2013/04/09 01:48:17 UTC

Best format to use

Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions? 

Thanks

RE: Best format to use

Posted by Viraj Bhat <vi...@yahoo-inc.com>.
Pig supports AvroStorage() UDF for both loading and storing  and is currently residing in the Piggybank
http://svn.apache.org/repos/asf/pig/branches/branch-0.11/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/
Also there is a version in github which is currently being ported to trunk.
https://github.com/josephadler/fast-avro-storage
Regards
Viraj

From: Nitin Pawar [mailto:nitinpawar432@gmail.com]
Sent: Tuesday, April 09, 2013 12:00 PM
To: user@hadoop.apache.org
Subject: Re: Best format to use

not sure about pig or impala

but in hive you got this
https://cwiki.apache.org/Hive/avroserde-working-with-avro-from-hive.html


On Wed, Apr 10, 2013 at 12:26 AM, Mark <st...@gmail.com>> wrote:
Avro is pretty sweet but is it supported by Hive, Pig and Impala. Is it splittable?

On Apr 9, 2013, at 10:58 AM, Roman Shaposhnik <rv...@apache.org>> wrote:

> On Tue, Apr 9, 2013 at 9:50 AM, Mark <st...@gmail.com>> wrote:
>> Forgetting Impala, what format would be best to use with daily logs?
>>
>> Block-compressed sequence files?
>
> I'd actually use avro encoded files.
>
> Thanks,
> Roman.



--
Nitin Pawar

RE: Best format to use

Posted by Viraj Bhat <vi...@yahoo-inc.com>.
Pig supports AvroStorage() UDF for both loading and storing  and is currently residing in the Piggybank
http://svn.apache.org/repos/asf/pig/branches/branch-0.11/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/
Also there is a version in github which is currently being ported to trunk.
https://github.com/josephadler/fast-avro-storage
Regards
Viraj

From: Nitin Pawar [mailto:nitinpawar432@gmail.com]
Sent: Tuesday, April 09, 2013 12:00 PM
To: user@hadoop.apache.org
Subject: Re: Best format to use

not sure about pig or impala

but in hive you got this
https://cwiki.apache.org/Hive/avroserde-working-with-avro-from-hive.html


On Wed, Apr 10, 2013 at 12:26 AM, Mark <st...@gmail.com>> wrote:
Avro is pretty sweet but is it supported by Hive, Pig and Impala. Is it splittable?

On Apr 9, 2013, at 10:58 AM, Roman Shaposhnik <rv...@apache.org>> wrote:

> On Tue, Apr 9, 2013 at 9:50 AM, Mark <st...@gmail.com>> wrote:
>> Forgetting Impala, what format would be best to use with daily logs?
>>
>> Block-compressed sequence files?
>
> I'd actually use avro encoded files.
>
> Thanks,
> Roman.



--
Nitin Pawar

RE: Best format to use

Posted by Viraj Bhat <vi...@yahoo-inc.com>.
Pig supports AvroStorage() UDF for both loading and storing  and is currently residing in the Piggybank
http://svn.apache.org/repos/asf/pig/branches/branch-0.11/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/
Also there is a version in github which is currently being ported to trunk.
https://github.com/josephadler/fast-avro-storage
Regards
Viraj

From: Nitin Pawar [mailto:nitinpawar432@gmail.com]
Sent: Tuesday, April 09, 2013 12:00 PM
To: user@hadoop.apache.org
Subject: Re: Best format to use

not sure about pig or impala

but in hive you got this
https://cwiki.apache.org/Hive/avroserde-working-with-avro-from-hive.html


On Wed, Apr 10, 2013 at 12:26 AM, Mark <st...@gmail.com>> wrote:
Avro is pretty sweet but is it supported by Hive, Pig and Impala. Is it splittable?

On Apr 9, 2013, at 10:58 AM, Roman Shaposhnik <rv...@apache.org>> wrote:

> On Tue, Apr 9, 2013 at 9:50 AM, Mark <st...@gmail.com>> wrote:
>> Forgetting Impala, what format would be best to use with daily logs?
>>
>> Block-compressed sequence files?
>
> I'd actually use avro encoded files.
>
> Thanks,
> Roman.



--
Nitin Pawar

RE: Best format to use

Posted by Viraj Bhat <vi...@yahoo-inc.com>.
Pig supports AvroStorage() UDF for both loading and storing  and is currently residing in the Piggybank
http://svn.apache.org/repos/asf/pig/branches/branch-0.11/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/
Also there is a version in github which is currently being ported to trunk.
https://github.com/josephadler/fast-avro-storage
Regards
Viraj

From: Nitin Pawar [mailto:nitinpawar432@gmail.com]
Sent: Tuesday, April 09, 2013 12:00 PM
To: user@hadoop.apache.org
Subject: Re: Best format to use

not sure about pig or impala

but in hive you got this
https://cwiki.apache.org/Hive/avroserde-working-with-avro-from-hive.html


On Wed, Apr 10, 2013 at 12:26 AM, Mark <st...@gmail.com>> wrote:
Avro is pretty sweet but is it supported by Hive, Pig and Impala. Is it splittable?

On Apr 9, 2013, at 10:58 AM, Roman Shaposhnik <rv...@apache.org>> wrote:

> On Tue, Apr 9, 2013 at 9:50 AM, Mark <st...@gmail.com>> wrote:
>> Forgetting Impala, what format would be best to use with daily logs?
>>
>> Block-compressed sequence files?
>
> I'd actually use avro encoded files.
>
> Thanks,
> Roman.



--
Nitin Pawar

Re: Best format to use

Posted by Nitin Pawar <ni...@gmail.com>.
not sure about pig or impala

but in hive you got this
https://cwiki.apache.org/Hive/avroserde-working-with-avro-from-hive.html



On Wed, Apr 10, 2013 at 12:26 AM, Mark <st...@gmail.com> wrote:

> Avro is pretty sweet but is it supported by Hive, Pig and Impala. Is it
> splittable?
>
> On Apr 9, 2013, at 10:58 AM, Roman Shaposhnik <rv...@apache.org> wrote:
>
> > On Tue, Apr 9, 2013 at 9:50 AM, Mark <st...@gmail.com> wrote:
> >> Forgetting Impala, what format would be best to use with daily logs?
> >>
> >> Block-compressed sequence files?
> >
> > I'd actually use avro encoded files.
> >
> > Thanks,
> > Roman.
>
>


-- 
Nitin Pawar

Re: Best format to use

Posted by Nitin Pawar <ni...@gmail.com>.
not sure about pig or impala

but in hive you got this
https://cwiki.apache.org/Hive/avroserde-working-with-avro-from-hive.html



On Wed, Apr 10, 2013 at 12:26 AM, Mark <st...@gmail.com> wrote:

> Avro is pretty sweet but is it supported by Hive, Pig and Impala. Is it
> splittable?
>
> On Apr 9, 2013, at 10:58 AM, Roman Shaposhnik <rv...@apache.org> wrote:
>
> > On Tue, Apr 9, 2013 at 9:50 AM, Mark <st...@gmail.com> wrote:
> >> Forgetting Impala, what format would be best to use with daily logs?
> >>
> >> Block-compressed sequence files?
> >
> > I'd actually use avro encoded files.
> >
> > Thanks,
> > Roman.
>
>


-- 
Nitin Pawar

Re: Best format to use

Posted by Nitin Pawar <ni...@gmail.com>.
not sure about pig or impala

but in hive you got this
https://cwiki.apache.org/Hive/avroserde-working-with-avro-from-hive.html



On Wed, Apr 10, 2013 at 12:26 AM, Mark <st...@gmail.com> wrote:

> Avro is pretty sweet but is it supported by Hive, Pig and Impala. Is it
> splittable?
>
> On Apr 9, 2013, at 10:58 AM, Roman Shaposhnik <rv...@apache.org> wrote:
>
> > On Tue, Apr 9, 2013 at 9:50 AM, Mark <st...@gmail.com> wrote:
> >> Forgetting Impala, what format would be best to use with daily logs?
> >>
> >> Block-compressed sequence files?
> >
> > I'd actually use avro encoded files.
> >
> > Thanks,
> > Roman.
>
>


-- 
Nitin Pawar

Re: Best format to use

Posted by Nitin Pawar <ni...@gmail.com>.
not sure about pig or impala

but in hive you got this
https://cwiki.apache.org/Hive/avroserde-working-with-avro-from-hive.html



On Wed, Apr 10, 2013 at 12:26 AM, Mark <st...@gmail.com> wrote:

> Avro is pretty sweet but is it supported by Hive, Pig and Impala. Is it
> splittable?
>
> On Apr 9, 2013, at 10:58 AM, Roman Shaposhnik <rv...@apache.org> wrote:
>
> > On Tue, Apr 9, 2013 at 9:50 AM, Mark <st...@gmail.com> wrote:
> >> Forgetting Impala, what format would be best to use with daily logs?
> >>
> >> Block-compressed sequence files?
> >
> > I'd actually use avro encoded files.
> >
> > Thanks,
> > Roman.
>
>


-- 
Nitin Pawar

Re: Best format to use

Posted by Mark <st...@gmail.com>.
Avro is pretty sweet but is it supported by Hive, Pig and Impala. Is it splittable?

On Apr 9, 2013, at 10:58 AM, Roman Shaposhnik <rv...@apache.org> wrote:

> On Tue, Apr 9, 2013 at 9:50 AM, Mark <st...@gmail.com> wrote:
>> Forgetting Impala, what format would be best to use with daily logs?
>> 
>> Block-compressed sequence files?
> 
> I'd actually use avro encoded files.
> 
> Thanks,
> Roman.


Re: Best format to use

Posted by Mark <st...@gmail.com>.
Avro is pretty sweet but is it supported by Hive, Pig and Impala. Is it splittable?

On Apr 9, 2013, at 10:58 AM, Roman Shaposhnik <rv...@apache.org> wrote:

> On Tue, Apr 9, 2013 at 9:50 AM, Mark <st...@gmail.com> wrote:
>> Forgetting Impala, what format would be best to use with daily logs?
>> 
>> Block-compressed sequence files?
> 
> I'd actually use avro encoded files.
> 
> Thanks,
> Roman.


Re: Best format to use

Posted by Mark <st...@gmail.com>.
Avro is pretty sweet but is it supported by Hive, Pig and Impala. Is it splittable?

On Apr 9, 2013, at 10:58 AM, Roman Shaposhnik <rv...@apache.org> wrote:

> On Tue, Apr 9, 2013 at 9:50 AM, Mark <st...@gmail.com> wrote:
>> Forgetting Impala, what format would be best to use with daily logs?
>> 
>> Block-compressed sequence files?
> 
> I'd actually use avro encoded files.
> 
> Thanks,
> Roman.


Re: Best format to use

Posted by Mark <st...@gmail.com>.
Avro is pretty sweet but is it supported by Hive, Pig and Impala. Is it splittable?

On Apr 9, 2013, at 10:58 AM, Roman Shaposhnik <rv...@apache.org> wrote:

> On Tue, Apr 9, 2013 at 9:50 AM, Mark <st...@gmail.com> wrote:
>> Forgetting Impala, what format would be best to use with daily logs?
>> 
>> Block-compressed sequence files?
> 
> I'd actually use avro encoded files.
> 
> Thanks,
> Roman.


Re: Best format to use

Posted by Roman Shaposhnik <rv...@apache.org>.
On Tue, Apr 9, 2013 at 9:50 AM, Mark <st...@gmail.com> wrote:
> Forgetting Impala, what format would be best to use with daily logs?
>
> Block-compressed sequence files?

I'd actually use avro encoded files.

Thanks,
Roman.

Re: Best format to use

Posted by Roman Shaposhnik <rv...@apache.org>.
On Tue, Apr 9, 2013 at 9:50 AM, Mark <st...@gmail.com> wrote:
> Forgetting Impala, what format would be best to use with daily logs?
>
> Block-compressed sequence files?

I'd actually use avro encoded files.

Thanks,
Roman.

Re: Best format to use

Posted by Roman Shaposhnik <rv...@apache.org>.
On Tue, Apr 9, 2013 at 9:50 AM, Mark <st...@gmail.com> wrote:
> Forgetting Impala, what format would be best to use with daily logs?
>
> Block-compressed sequence files?

I'd actually use avro encoded files.

Thanks,
Roman.

Re: Best format to use

Posted by Roman Shaposhnik <rv...@apache.org>.
On Tue, Apr 9, 2013 at 9:50 AM, Mark <st...@gmail.com> wrote:
> Forgetting Impala, what format would be best to use with daily logs?
>
> Block-compressed sequence files?

I'd actually use avro encoded files.

Thanks,
Roman.

Re: Best format to use

Posted by Nitin Pawar <ni...@gmail.com>.
there is another important point to look as well

what kind of queries are you planning to run? there are different types of
formats which suit to different needs.
like when you are doing something related to couple of columns only then
you may choose RCFile  (newer ORCFile Format)
when you got to read few records together then may be sequence files are
useful

there are others as wel


On Tue, Apr 9, 2013 at 10:21 PM, Mark <st...@gmail.com> wrote:

> Actually, compressed sequence files may not work with Pig or Hive then
> right?
>
> On Apr 9, 2013, at 9:50 AM, Mark <st...@gmail.com> wrote:
>
> > Forgetting Impala, what format would be best to use with daily logs?
> >
> > Block-compressed sequence files?
> >
> > On Apr 8, 2013, at 8:12 PM, Harsh J <ha...@cloudera.com> wrote:
> >
> >> Hey Mark,
> >>
> >> Gzip codec creates extension .gzip, not .deflate (which is
> >> DeflateCodec). You may want to re-check your settings.
> >>
> >> Impala questions are best resolved at its current user and developer
> >> community at
> https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
> >> Impala does currently support LZO (and also Indexed LZO) compressed
> >> text files however, so you may want to try that as its splittable
> >> (compared to Gzip ones).
> >>
> >> On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
> >>> Trying to determine what the best format to use for storing daily
> logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm
> wondering if there is something better? Our main clients for these daily
> logs are pig and hive using an external table. We were thinking about
> testing out impala but we see that it doesn't work with compressed text
> files. Any suggestions?
> >>>
> >>> Thanks
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
>
>


-- 
Nitin Pawar

Re: Best format to use

Posted by Nitin Pawar <ni...@gmail.com>.
there is another important point to look as well

what kind of queries are you planning to run? there are different types of
formats which suit to different needs.
like when you are doing something related to couple of columns only then
you may choose RCFile  (newer ORCFile Format)
when you got to read few records together then may be sequence files are
useful

there are others as wel


On Tue, Apr 9, 2013 at 10:21 PM, Mark <st...@gmail.com> wrote:

> Actually, compressed sequence files may not work with Pig or Hive then
> right?
>
> On Apr 9, 2013, at 9:50 AM, Mark <st...@gmail.com> wrote:
>
> > Forgetting Impala, what format would be best to use with daily logs?
> >
> > Block-compressed sequence files?
> >
> > On Apr 8, 2013, at 8:12 PM, Harsh J <ha...@cloudera.com> wrote:
> >
> >> Hey Mark,
> >>
> >> Gzip codec creates extension .gzip, not .deflate (which is
> >> DeflateCodec). You may want to re-check your settings.
> >>
> >> Impala questions are best resolved at its current user and developer
> >> community at
> https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
> >> Impala does currently support LZO (and also Indexed LZO) compressed
> >> text files however, so you may want to try that as its splittable
> >> (compared to Gzip ones).
> >>
> >> On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
> >>> Trying to determine what the best format to use for storing daily
> logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm
> wondering if there is something better? Our main clients for these daily
> logs are pig and hive using an external table. We were thinking about
> testing out impala but we see that it doesn't work with compressed text
> files. Any suggestions?
> >>>
> >>> Thanks
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
>
>


-- 
Nitin Pawar

Re: Best format to use

Posted by Harsh J <ha...@cloudera.com>.
Pig and Hive both have support for compressed sequence files.

Regarding best format - if its just text log data (i.e. no
types/structure) then the best format to keep it in is in
text+compress. SequenceFiles help make it splittable but add a small
overhead in space and efficiency and none of the good codecs out there
are splittable on their own for compression (LZO is good, but needs
pre-indexing to be viewed splittable).

On Tue, Apr 9, 2013 at 10:21 PM, Mark <st...@gmail.com> wrote:
> Actually, compressed sequence files may not work with Pig or Hive then right?
>
> On Apr 9, 2013, at 9:50 AM, Mark <st...@gmail.com> wrote:
>
>> Forgetting Impala, what format would be best to use with daily logs?
>>
>> Block-compressed sequence files?
>>
>> On Apr 8, 2013, at 8:12 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Hey Mark,
>>>
>>> Gzip codec creates extension .gzip, not .deflate (which is
>>> DeflateCodec). You may want to re-check your settings.
>>>
>>> Impala questions are best resolved at its current user and developer
>>> community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
>>> Impala does currently support LZO (and also Indexed LZO) compressed
>>> text files however, so you may want to try that as its splittable
>>> (compared to Gzip ones).
>>>
>>> On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
>>>> Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions?
>>>>
>>>> Thanks
>>>
>>>
>>>
>>> --
>>> Harsh J
>>
>



-- 
Harsh J

Re: Best format to use

Posted by Harsh J <ha...@cloudera.com>.
Pig and Hive both have support for compressed sequence files.

Regarding best format - if its just text log data (i.e. no
types/structure) then the best format to keep it in is in
text+compress. SequenceFiles help make it splittable but add a small
overhead in space and efficiency and none of the good codecs out there
are splittable on their own for compression (LZO is good, but needs
pre-indexing to be viewed splittable).

On Tue, Apr 9, 2013 at 10:21 PM, Mark <st...@gmail.com> wrote:
> Actually, compressed sequence files may not work with Pig or Hive then right?
>
> On Apr 9, 2013, at 9:50 AM, Mark <st...@gmail.com> wrote:
>
>> Forgetting Impala, what format would be best to use with daily logs?
>>
>> Block-compressed sequence files?
>>
>> On Apr 8, 2013, at 8:12 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Hey Mark,
>>>
>>> Gzip codec creates extension .gzip, not .deflate (which is
>>> DeflateCodec). You may want to re-check your settings.
>>>
>>> Impala questions are best resolved at its current user and developer
>>> community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
>>> Impala does currently support LZO (and also Indexed LZO) compressed
>>> text files however, so you may want to try that as its splittable
>>> (compared to Gzip ones).
>>>
>>> On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
>>>> Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions?
>>>>
>>>> Thanks
>>>
>>>
>>>
>>> --
>>> Harsh J
>>
>



-- 
Harsh J

Re: Best format to use

Posted by Nitin Pawar <ni...@gmail.com>.
there is another important point to look as well

what kind of queries are you planning to run? there are different types of
formats which suit to different needs.
like when you are doing something related to couple of columns only then
you may choose RCFile  (newer ORCFile Format)
when you got to read few records together then may be sequence files are
useful

there are others as wel


On Tue, Apr 9, 2013 at 10:21 PM, Mark <st...@gmail.com> wrote:

> Actually, compressed sequence files may not work with Pig or Hive then
> right?
>
> On Apr 9, 2013, at 9:50 AM, Mark <st...@gmail.com> wrote:
>
> > Forgetting Impala, what format would be best to use with daily logs?
> >
> > Block-compressed sequence files?
> >
> > On Apr 8, 2013, at 8:12 PM, Harsh J <ha...@cloudera.com> wrote:
> >
> >> Hey Mark,
> >>
> >> Gzip codec creates extension .gzip, not .deflate (which is
> >> DeflateCodec). You may want to re-check your settings.
> >>
> >> Impala questions are best resolved at its current user and developer
> >> community at
> https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
> >> Impala does currently support LZO (and also Indexed LZO) compressed
> >> text files however, so you may want to try that as its splittable
> >> (compared to Gzip ones).
> >>
> >> On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
> >>> Trying to determine what the best format to use for storing daily
> logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm
> wondering if there is something better? Our main clients for these daily
> logs are pig and hive using an external table. We were thinking about
> testing out impala but we see that it doesn't work with compressed text
> files. Any suggestions?
> >>>
> >>> Thanks
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
>
>


-- 
Nitin Pawar

Re: Best format to use

Posted by Harsh J <ha...@cloudera.com>.
Pig and Hive both have support for compressed sequence files.

Regarding best format - if its just text log data (i.e. no
types/structure) then the best format to keep it in is in
text+compress. SequenceFiles help make it splittable but add a small
overhead in space and efficiency and none of the good codecs out there
are splittable on their own for compression (LZO is good, but needs
pre-indexing to be viewed splittable).

On Tue, Apr 9, 2013 at 10:21 PM, Mark <st...@gmail.com> wrote:
> Actually, compressed sequence files may not work with Pig or Hive then right?
>
> On Apr 9, 2013, at 9:50 AM, Mark <st...@gmail.com> wrote:
>
>> Forgetting Impala, what format would be best to use with daily logs?
>>
>> Block-compressed sequence files?
>>
>> On Apr 8, 2013, at 8:12 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Hey Mark,
>>>
>>> Gzip codec creates extension .gzip, not .deflate (which is
>>> DeflateCodec). You may want to re-check your settings.
>>>
>>> Impala questions are best resolved at its current user and developer
>>> community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
>>> Impala does currently support LZO (and also Indexed LZO) compressed
>>> text files however, so you may want to try that as its splittable
>>> (compared to Gzip ones).
>>>
>>> On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
>>>> Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions?
>>>>
>>>> Thanks
>>>
>>>
>>>
>>> --
>>> Harsh J
>>
>



-- 
Harsh J

Re: Best format to use

Posted by Nitin Pawar <ni...@gmail.com>.
there is another important point to look as well

what kind of queries are you planning to run? there are different types of
formats which suit to different needs.
like when you are doing something related to couple of columns only then
you may choose RCFile  (newer ORCFile Format)
when you got to read few records together then may be sequence files are
useful

there are others as wel


On Tue, Apr 9, 2013 at 10:21 PM, Mark <st...@gmail.com> wrote:

> Actually, compressed sequence files may not work with Pig or Hive then
> right?
>
> On Apr 9, 2013, at 9:50 AM, Mark <st...@gmail.com> wrote:
>
> > Forgetting Impala, what format would be best to use with daily logs?
> >
> > Block-compressed sequence files?
> >
> > On Apr 8, 2013, at 8:12 PM, Harsh J <ha...@cloudera.com> wrote:
> >
> >> Hey Mark,
> >>
> >> Gzip codec creates extension .gzip, not .deflate (which is
> >> DeflateCodec). You may want to re-check your settings.
> >>
> >> Impala questions are best resolved at its current user and developer
> >> community at
> https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
> >> Impala does currently support LZO (and also Indexed LZO) compressed
> >> text files however, so you may want to try that as its splittable
> >> (compared to Gzip ones).
> >>
> >> On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
> >>> Trying to determine what the best format to use for storing daily
> logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm
> wondering if there is something better? Our main clients for these daily
> logs are pig and hive using an external table. We were thinking about
> testing out impala but we see that it doesn't work with compressed text
> files. Any suggestions?
> >>>
> >>> Thanks
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
>
>


-- 
Nitin Pawar

Re: Best format to use

Posted by Harsh J <ha...@cloudera.com>.
Pig and Hive both have support for compressed sequence files.

Regarding best format - if its just text log data (i.e. no
types/structure) then the best format to keep it in is in
text+compress. SequenceFiles help make it splittable but add a small
overhead in space and efficiency and none of the good codecs out there
are splittable on their own for compression (LZO is good, but needs
pre-indexing to be viewed splittable).

On Tue, Apr 9, 2013 at 10:21 PM, Mark <st...@gmail.com> wrote:
> Actually, compressed sequence files may not work with Pig or Hive then right?
>
> On Apr 9, 2013, at 9:50 AM, Mark <st...@gmail.com> wrote:
>
>> Forgetting Impala, what format would be best to use with daily logs?
>>
>> Block-compressed sequence files?
>>
>> On Apr 8, 2013, at 8:12 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Hey Mark,
>>>
>>> Gzip codec creates extension .gzip, not .deflate (which is
>>> DeflateCodec). You may want to re-check your settings.
>>>
>>> Impala questions are best resolved at its current user and developer
>>> community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
>>> Impala does currently support LZO (and also Indexed LZO) compressed
>>> text files however, so you may want to try that as its splittable
>>> (compared to Gzip ones).
>>>
>>> On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
>>>> Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions?
>>>>
>>>> Thanks
>>>
>>>
>>>
>>> --
>>> Harsh J
>>
>



-- 
Harsh J

Re: Best format to use

Posted by Mark <st...@gmail.com>.
Actually, compressed sequence files may not work with Pig or Hive then right?

On Apr 9, 2013, at 9:50 AM, Mark <st...@gmail.com> wrote:

> Forgetting Impala, what format would be best to use with daily logs? 
> 
> Block-compressed sequence files?
> 
> On Apr 8, 2013, at 8:12 PM, Harsh J <ha...@cloudera.com> wrote:
> 
>> Hey Mark,
>> 
>> Gzip codec creates extension .gzip, not .deflate (which is
>> DeflateCodec). You may want to re-check your settings.
>> 
>> Impala questions are best resolved at its current user and developer
>> community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
>> Impala does currently support LZO (and also Indexed LZO) compressed
>> text files however, so you may want to try that as its splittable
>> (compared to Gzip ones).
>> 
>> On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
>>> Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions?
>>> 
>>> Thanks
>> 
>> 
>> 
>> -- 
>> Harsh J
> 


Re: Best format to use

Posted by Mark <st...@gmail.com>.
Actually, compressed sequence files may not work with Pig or Hive then right?

On Apr 9, 2013, at 9:50 AM, Mark <st...@gmail.com> wrote:

> Forgetting Impala, what format would be best to use with daily logs? 
> 
> Block-compressed sequence files?
> 
> On Apr 8, 2013, at 8:12 PM, Harsh J <ha...@cloudera.com> wrote:
> 
>> Hey Mark,
>> 
>> Gzip codec creates extension .gzip, not .deflate (which is
>> DeflateCodec). You may want to re-check your settings.
>> 
>> Impala questions are best resolved at its current user and developer
>> community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
>> Impala does currently support LZO (and also Indexed LZO) compressed
>> text files however, so you may want to try that as its splittable
>> (compared to Gzip ones).
>> 
>> On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
>>> Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions?
>>> 
>>> Thanks
>> 
>> 
>> 
>> -- 
>> Harsh J
> 


Re: Best format to use

Posted by Mark <st...@gmail.com>.
Actually, compressed sequence files may not work with Pig or Hive then right?

On Apr 9, 2013, at 9:50 AM, Mark <st...@gmail.com> wrote:

> Forgetting Impala, what format would be best to use with daily logs? 
> 
> Block-compressed sequence files?
> 
> On Apr 8, 2013, at 8:12 PM, Harsh J <ha...@cloudera.com> wrote:
> 
>> Hey Mark,
>> 
>> Gzip codec creates extension .gzip, not .deflate (which is
>> DeflateCodec). You may want to re-check your settings.
>> 
>> Impala questions are best resolved at its current user and developer
>> community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
>> Impala does currently support LZO (and also Indexed LZO) compressed
>> text files however, so you may want to try that as its splittable
>> (compared to Gzip ones).
>> 
>> On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
>>> Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions?
>>> 
>>> Thanks
>> 
>> 
>> 
>> -- 
>> Harsh J
> 


Re: Best format to use

Posted by Mark <st...@gmail.com>.
Actually, compressed sequence files may not work with Pig or Hive then right?

On Apr 9, 2013, at 9:50 AM, Mark <st...@gmail.com> wrote:

> Forgetting Impala, what format would be best to use with daily logs? 
> 
> Block-compressed sequence files?
> 
> On Apr 8, 2013, at 8:12 PM, Harsh J <ha...@cloudera.com> wrote:
> 
>> Hey Mark,
>> 
>> Gzip codec creates extension .gzip, not .deflate (which is
>> DeflateCodec). You may want to re-check your settings.
>> 
>> Impala questions are best resolved at its current user and developer
>> community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
>> Impala does currently support LZO (and also Indexed LZO) compressed
>> text files however, so you may want to try that as its splittable
>> (compared to Gzip ones).
>> 
>> On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
>>> Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions?
>>> 
>>> Thanks
>> 
>> 
>> 
>> -- 
>> Harsh J
> 


Re: Best format to use

Posted by Mark <st...@gmail.com>.
Forgetting Impala, what format would be best to use with daily logs? 

Block-compressed sequence files?

On Apr 8, 2013, at 8:12 PM, Harsh J <ha...@cloudera.com> wrote:

> Hey Mark,
> 
> Gzip codec creates extension .gzip, not .deflate (which is
> DeflateCodec). You may want to re-check your settings.
> 
> Impala questions are best resolved at its current user and developer
> community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
> Impala does currently support LZO (and also Indexed LZO) compressed
> text files however, so you may want to try that as its splittable
> (compared to Gzip ones).
> 
> On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
>> Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions?
>> 
>> Thanks
> 
> 
> 
> -- 
> Harsh J


Re: Best format to use

Posted by Mark <st...@gmail.com>.
Forgetting Impala, what format would be best to use with daily logs? 

Block-compressed sequence files?

On Apr 8, 2013, at 8:12 PM, Harsh J <ha...@cloudera.com> wrote:

> Hey Mark,
> 
> Gzip codec creates extension .gzip, not .deflate (which is
> DeflateCodec). You may want to re-check your settings.
> 
> Impala questions are best resolved at its current user and developer
> community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
> Impala does currently support LZO (and also Indexed LZO) compressed
> text files however, so you may want to try that as its splittable
> (compared to Gzip ones).
> 
> On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
>> Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions?
>> 
>> Thanks
> 
> 
> 
> -- 
> Harsh J


Re: Best format to use

Posted by Mark <st...@gmail.com>.
Forgetting Impala, what format would be best to use with daily logs? 

Block-compressed sequence files?

On Apr 8, 2013, at 8:12 PM, Harsh J <ha...@cloudera.com> wrote:

> Hey Mark,
> 
> Gzip codec creates extension .gzip, not .deflate (which is
> DeflateCodec). You may want to re-check your settings.
> 
> Impala questions are best resolved at its current user and developer
> community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
> Impala does currently support LZO (and also Indexed LZO) compressed
> text files however, so you may want to try that as its splittable
> (compared to Gzip ones).
> 
> On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
>> Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions?
>> 
>> Thanks
> 
> 
> 
> -- 
> Harsh J


Re: Best format to use

Posted by Mark <st...@gmail.com>.
Forgetting Impala, what format would be best to use with daily logs? 

Block-compressed sequence files?

On Apr 8, 2013, at 8:12 PM, Harsh J <ha...@cloudera.com> wrote:

> Hey Mark,
> 
> Gzip codec creates extension .gzip, not .deflate (which is
> DeflateCodec). You may want to re-check your settings.
> 
> Impala questions are best resolved at its current user and developer
> community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
> Impala does currently support LZO (and also Indexed LZO) compressed
> text files however, so you may want to try that as its splittable
> (compared to Gzip ones).
> 
> On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
>> Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions?
>> 
>> Thanks
> 
> 
> 
> -- 
> Harsh J


Re: Best format to use

Posted by Harsh J <ha...@cloudera.com>.
Hey Mark,

Gzip codec creates extension .gzip, not .deflate (which is
DeflateCodec). You may want to re-check your settings.

Impala questions are best resolved at its current user and developer
community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
Impala does currently support LZO (and also Indexed LZO) compressed
text files however, so you may want to try that as its splittable
(compared to Gzip ones).

On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
> Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions?
>
> Thanks



-- 
Harsh J

Re: Best format to use

Posted by Harsh J <ha...@cloudera.com>.
Hey Mark,

Gzip codec creates extension .gzip, not .deflate (which is
DeflateCodec). You may want to re-check your settings.

Impala questions are best resolved at its current user and developer
community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
Impala does currently support LZO (and also Indexed LZO) compressed
text files however, so you may want to try that as its splittable
(compared to Gzip ones).

On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
> Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions?
>
> Thanks



-- 
Harsh J

Re: Best format to use

Posted by Azuryy Yu <az...@gmail.com>.
impala can work with compressed files, but it's sequence file, not
compressed directly.


On Tue, Apr 9, 2013 at 7:48 AM, Mark <st...@gmail.com> wrote:

> Trying to determine what the best format to use for storing daily logs. We
> recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering
> if there is something better? Our main clients for these daily logs are pig
> and hive using an external table. We were thinking about testing out impala
> but we see that it doesn't work with compressed text files. Any suggestions?
>
> Thanks

Re: Best format to use

Posted by Harsh J <ha...@cloudera.com>.
Hey Mark,

Gzip codec creates extension .gzip, not .deflate (which is
DeflateCodec). You may want to re-check your settings.

Impala questions are best resolved at its current user and developer
community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
Impala does currently support LZO (and also Indexed LZO) compressed
text files however, so you may want to try that as its splittable
(compared to Gzip ones).

On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
> Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions?
>
> Thanks



-- 
Harsh J

Re: Best format to use

Posted by Azuryy Yu <az...@gmail.com>.
impala can work with compressed files, but it's sequence file, not
compressed directly.


On Tue, Apr 9, 2013 at 7:48 AM, Mark <st...@gmail.com> wrote:

> Trying to determine what the best format to use for storing daily logs. We
> recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering
> if there is something better? Our main clients for these daily logs are pig
> and hive using an external table. We were thinking about testing out impala
> but we see that it doesn't work with compressed text files. Any suggestions?
>
> Thanks

Re: Best format to use

Posted by Azuryy Yu <az...@gmail.com>.
impala can work with compressed files, but it's sequence file, not
compressed directly.


On Tue, Apr 9, 2013 at 7:48 AM, Mark <st...@gmail.com> wrote:

> Trying to determine what the best format to use for storing daily logs. We
> recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering
> if there is something better? Our main clients for these daily logs are pig
> and hive using an external table. We were thinking about testing out impala
> but we see that it doesn't work with compressed text files. Any suggestions?
>
> Thanks

Re: Best format to use

Posted by Azuryy Yu <az...@gmail.com>.
impala can work with compressed files, but it's sequence file, not
compressed directly.


On Tue, Apr 9, 2013 at 7:48 AM, Mark <st...@gmail.com> wrote:

> Trying to determine what the best format to use for storing daily logs. We
> recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering
> if there is something better? Our main clients for these daily logs are pig
> and hive using an external table. We were thinking about testing out impala
> but we see that it doesn't work with compressed text files. Any suggestions?
>
> Thanks

Re: Best format to use

Posted by Harsh J <ha...@cloudera.com>.
Hey Mark,

Gzip codec creates extension .gzip, not .deflate (which is
DeflateCodec). You may want to re-check your settings.

Impala questions are best resolved at its current user and developer
community at https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user.
Impala does currently support LZO (and also Indexed LZO) compressed
text files however, so you may want to try that as its splittable
(compared to Gzip ones).

On Tue, Apr 9, 2013 at 5:18 AM, Mark <st...@gmail.com> wrote:
> Trying to determine what the best format to use for storing daily logs. We recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering if there is something better? Our main clients for these daily logs are pig and hive using an external table. We were thinking about testing out impala but we see that it doesn't work with compressed text files. Any suggestions?
>
> Thanks



-- 
Harsh J