You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Michael Jiang <it...@gmail.com> on 2011/03/30 20:55:35 UTC

how to convert single line into multiple lines in a serde (txt in txt out)?

Want to extend RegexSerDe to parse apache web log: for each log entry, need
to convert it into multiple entries. This is easy in streaming. But new to
serde, wondering if it is doable and how? Thanks!

Re: how to convert single line into multiple lines in a serde (txt in txt out)?

Posted by Michael Jiang <it...@gmail.com>.
Thanks Edward. You mean implement "deserialize" to return a list<struct>?
What is explode()? Sorry for basic questions. Could you please elaborate
this a bit more or give me a link to some reference? Thanks!

On Wed, Mar 30, 2011 at 12:03 PM, Edward Capriolo <ed...@gmail.com>wrote:

> On Wed, Mar 30, 2011 at 2:55 PM, Michael Jiang <it...@gmail.com>
> wrote:
> > Want to extend RegexSerDe to parse apache web log: for each log entry,
> need
> > to convert it into multiple entries. This is easy in streaming. But new
> to
> > serde, wondering if it is doable and how? Thanks!
> >
>
> You can have your serde produce list<struct> and then explode() them.
>

Re: how to convert single line into multiple lines in a serde (txt in txt out)?

Posted by Michael Jiang <it...@gmail.com>.
Thanks Edward. That'll work.

But that also means 2 tables will be created. How about we only want one
table by using some serde s.t. it reads apache web log, generates multiple
rows for each line of entry in the log that get loaded into the target table
that I want? Is it doable by customizing RegexSerde? i.e. "create external
table A (...) row format serde 'serdeclass' with serdeproperties (...)
stored as textfile location 'pathtoapachelog';" will give you the table that
has right fields extracted and multiple rows generated for query later.

If I cannot create such a serde without creating a 2nd table for the task, I
think streaming is a better choice from source code management aspect: using
serde requires you to manage more libraries (hadoop, hive ...) for build.

Thanks!

On Wed, Mar 30, 2011 at 1:16 PM, Edward Capriolo <ed...@gmail.com>wrote:

> On Wed, Mar 30, 2011 at 3:46 PM, Michael Jiang <it...@gmail.com>
> wrote:
> > Also what if I want just one step to load each log entry line from log
> file
> > and for each generate multiple lines? That is, just one table created. I
> > don't want to have one table and then call explode() to get multiple
> lines.
> > Otherwise, alternative way is to use streaming on loaded table to turn it
> > into another one with no need to customize a serde. So, yeah, the goal
> here
> > is to see how a serde can do this stuff.
> >
> > Thanks!
> >
> > On Wed, Mar 30, 2011 at 12:03 PM, Edward Capriolo <edlinuxguru@gmail.com
> >
> > wrote:
> >>
> >> On Wed, Mar 30, 2011 at 2:55 PM, Michael Jiang <it...@gmail.com>
> >> wrote:
> >> > Want to extend RegexSerDe to parse apache web log: for each log entry,
> >> > need
> >> > to convert it into multiple entries. This is easy in streaming. But
> new
> >> > to
> >> > serde, wondering if it is doable and how? Thanks!
> >> >
> >>
> >> You can have your serde produce list<struct> and then explode() them.
> >
> >
>
> The role of SerDe is to take the output from the InputFormat and use
> the information inside the metastore to decode it. As a result this is
> not a good fit for a spot to turn a single row into multiple rows.
>
> What I am suggesting is define a column like this
>
> create table ...( id int, list<String> log_entries) RowFormat serde....
>
> Make sure your serde decodes and populates log_entires.
>
> From there you can use lateral view and explode
> http://wiki.apache.org/hadoop/Hive/LanguageManual/LateralView to turn
> the list<String> into rows.
>
>
> Edward
>

Re: how to convert single line into multiple lines in a serde (txt in txt out)?

Posted by Edward Capriolo <ed...@gmail.com>.
On Wed, Mar 30, 2011 at 3:46 PM, Michael Jiang <it...@gmail.com> wrote:
> Also what if I want just one step to load each log entry line from log file
> and for each generate multiple lines? That is, just one table created. I
> don't want to have one table and then call explode() to get multiple lines.
> Otherwise, alternative way is to use streaming on loaded table to turn it
> into another one with no need to customize a serde. So, yeah, the goal here
> is to see how a serde can do this stuff.
>
> Thanks!
>
> On Wed, Mar 30, 2011 at 12:03 PM, Edward Capriolo <ed...@gmail.com>
> wrote:
>>
>> On Wed, Mar 30, 2011 at 2:55 PM, Michael Jiang <it...@gmail.com>
>> wrote:
>> > Want to extend RegexSerDe to parse apache web log: for each log entry,
>> > need
>> > to convert it into multiple entries. This is easy in streaming. But new
>> > to
>> > serde, wondering if it is doable and how? Thanks!
>> >
>>
>> You can have your serde produce list<struct> and then explode() them.
>
>

The role of SerDe is to take the output from the InputFormat and use
the information inside the metastore to decode it. As a result this is
not a good fit for a spot to turn a single row into multiple rows.

What I am suggesting is define a column like this

create table ...( id int, list<String> log_entries) RowFormat serde....

Make sure your serde decodes and populates log_entires.

>From there you can use lateral view and explode
http://wiki.apache.org/hadoop/Hive/LanguageManual/LateralView to turn
the list<String> into rows.


Edward

Re: how to convert single line into multiple lines in a serde (txt in txt out)?

Posted by Michael Jiang <it...@gmail.com>.
Also what if I want just one step to load each log entry line from log file
and for each generate multiple lines? That is, just one table created. I
don't want to have one table and then call explode() to get multiple lines.
Otherwise, alternative way is to use streaming on loaded table to turn it
into another one with no need to customize a serde. So, yeah, the goal here
is to see how a serde can do this stuff.

Thanks!

On Wed, Mar 30, 2011 at 12:03 PM, Edward Capriolo <ed...@gmail.com>wrote:

> On Wed, Mar 30, 2011 at 2:55 PM, Michael Jiang <it...@gmail.com>
> wrote:
> > Want to extend RegexSerDe to parse apache web log: for each log entry,
> need
> > to convert it into multiple entries. This is easy in streaming. But new
> to
> > serde, wondering if it is doable and how? Thanks!
> >
>
> You can have your serde produce list<struct> and then explode() them.
>

Re: how to convert single line into multiple lines in a serde (txt in txt out)?

Posted by Edward Capriolo <ed...@gmail.com>.
On Wed, Mar 30, 2011 at 2:55 PM, Michael Jiang <it...@gmail.com> wrote:
> Want to extend RegexSerDe to parse apache web log: for each log entry, need
> to convert it into multiple entries. This is easy in streaming. But new to
> serde, wondering if it is doable and how? Thanks!
>

You can have your serde produce list<struct> and then explode() them.