You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by praveenesh kumar <pr...@gmail.com> on 2011/12/27 10:24:58 UTC

Custom input format for parsing text files

Hey people,

I have a plain text file.I want to parse it using M/R line by line. When I
am saying line it means plain text line that ends with a DOT.
Can I use M/R to do this kind of job. I know if I have to do it like this,
I have to write my own InputFormat.
Can someone guide me/or share their experience on this kind of problem ?

For better context.. suppose my files looks like this :

"Depending on your data processing needs, your Hadoop workload can vary
widely
over time. You may have a few large data processing jobs that occasionally
take advantage
of hundreds of nodes, but those same nodes will sit idle the rest of the
time.
You may be new to Hadoop and want to get familiar with it first before
investing in
a dedicated cluster. You may own a startup that needs to conserve cash and
wants
to avoid the capital expense of a Hadoop cluster."

I want to make the following K-V pair like this :

K1 - V1 -->Depending on your data processing needs, your Hadoop workload
can vary widely over time.
K2 - V2 -->You may have a few large data processing jobs that occasionally
take advantage of hundreds of nodes, but those same nodes will sit idle the
rest of the time.
K3 - V3 -->You may be new to Hadoop and want to get familiar with it first
before investing in a dedicated cluster.
K4 - V4 --> You may own a startup that needs to conserve cash and wants to
avoid the capital expense of a Hadoop cluster.

Thanks.
Praveenesh

Re: Custom input format for parsing text files

Posted by Harsh J <ha...@cloudera.com>.

This looks like a sentence extractor? How do you account for periods that are quoted? Your problem is much like a CSV file with commas inside its quoted values. Perhaps OpenCSV with delimiters configured may help you greatly as you write your own InputFormat.

If the above does not concern you, you may simply use the TextInputFormat with a custom record delimiter.
See https://issues.apache.org/jira/browse/MAPREDUCE-2254 [It looks available in Apache 0.23+ already, and a ready-to-backport patch for 0.20/1.0 is also available under any latest CDH3's source tar's cloudera/patches directory]

On 27-Dec-2011, at 2:54 PM, praveenesh kumar wrote:

> Hey people,
> 
> I have a plain text file.I want to parse it using M/R line by line. When I
> am saying line it means plain text line that ends with a DOT.
> Can I use M/R to do this kind of job. I know if I have to do it like this,
> I have to write my own InputFormat.
> Can someone guide me/or share their experience on this kind of problem ?
> 
> For better context.. suppose my files looks like this :
> 
> "Depending on your data processing needs, your Hadoop workload can vary
> widely
> over time. You may have a few large data processing jobs that occasionally
> take advantage
> of hundreds of nodes, but those same nodes will sit idle the rest of the
> time.
> You may be new to Hadoop and want to get familiar with it first before
> investing in
> a dedicated cluster. You may own a startup that needs to conserve cash and
> wants
> to avoid the capital expense of a Hadoop cluster."
> 
> I want to make the following K-V pair like this :
> 
> K1 - V1 -->Depending on your data processing needs, your Hadoop workload
> can vary widely over time.
> K2 - V2 -->You may have a few large data processing jobs that occasionally
> take advantage of hundreds of nodes, but those same nodes will sit idle the
> rest of the time.
> K3 - V3 -->You may be new to Hadoop and want to get familiar with it first
> before investing in a dedicated cluster.
> K4 - V4 --> You may own a startup that needs to conserve cash and wants to
> avoid the capital expense of a Hadoop cluster.
> 
> Thanks.
> Praveenesh