You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Ron Wurzberger <hm...@yahoo.com.INVALID> on 2014/07/18 22:06:38 UTC

How do I use Pig to break down a string blob

I am processing a data source containing log files. Each record contains a number of fields, but one field has a string value that is a dump of the log records for given day. The log entries contained within this string "blob" are not fixed length, but do follow a set pattern. I can break this blob down into individual log entries fairly easily using Java. The basic pattern is [<date>][<component>][<msgtype>] <message text> and a given blob may contain up to 100 such records. Can this be broken down using Pig? I'm looking for records containing specific message types to output.

Can someone point me to any examples where Pig is used to break down a string into substrings based on a pattern?
If I create a UDF using Java to break down the string into substrings, can I return the substrings as a list to Pig?
    If so, how do I iterate through the list in Pig?

Ron W.

Re: How do I use Pig to break down a string blob

Posted by Cheolsoo Park <pi...@gmail.com>.
1) Can someone point me to any examples where Pig is used to break down a
string into substrings based on a pattern?
REGEX_EXTRACT_ALL
<http://pig.apache.org/docs/r0.13.0/func.html#regex-extract-all> might
help. Or there are a couple of more built-in UDFs that might meet your
needs.

2) If I create a UDF using Java to break down the string into substrings,
can I return the substrings as a list to Pig?
You can return substrings as fields in a tuple, or as a bag.

3) If so, how do I iterate through the list in Pig?
What are you trying to achieve? You could iterate over substrings within
your UDF at 2 before return. Or if your UDF returns a tuple, you could
process each field in a foreach. Or if your UDF returns a bag, you could
flatten it and process them row by row.

Hope this helps.



On Fri, Jul 18, 2014 at 1:06 PM, Ron Wurzberger <
hmsdefender@yahoo.com.invalid> wrote:

> I am processing a data source containing log files. Each record contains a
> number of fields, but one field has a string value that is a dump of the
> log records for given day. The log entries contained within this string
> "blob" are not fixed length, but do follow a set pattern. I can break this
> blob down into individual log entries fairly easily using Java. The basic
> pattern is [<date>][<component>][<msgtype>] <message text> and a given blob
> may contain up to 100 such records. Can this be broken down using Pig? I'm
> looking for records containing specific message types to output.
>
> Can someone point me to any examples where Pig is used to break down a
> string into substrings based on a pattern?
> If I create a UDF using Java to break down the string into substrings, can
> I return the substrings as a list to Pig?
>     If so, how do I iterate through the list in Pig?
>
> Ron W.
>