You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Vipul Sharma <sh...@gmail.com> on 2009/11/03 00:01:38 UTC

XML input to map function

I am working on a mapreduce application that will take input from lots of
small xml files rather than one big xml file. Each xml files has some record
that I want to parse and input data in a hbase table. How should I go about
parsing xml files and input in map functions. Should I have one mapper per
xml file or is there another way of doing this? Thanks for your help and
time.

Regards,
Vipul Sharma,

Re: XML input to map function

Posted by Amandeep Khurana <am...@gmail.com>.
On Mon, Nov 2, 2009 at 4:39 PM, Vipul Sharma <sh...@gmail.com> wrote:

> Okay I think I was not clear in my first post about the question. Let me
> try
> again.
>
> I have an application that gets large number of xml files every minute
> which
> are copied over to hdfs. Each file is around 1Mb each and contains several
> records. Files are well formed xml files with a starting tag <startingtag>
> and end tag </startingtag> in each xml file. I want to parse these files
> and
> put relevant output data in hbase.


> Now as an input to map function I can read all the unread files in a string
> and parse them inside map function using DOM or sometjing like that. But
> then how do I deal with multiple starting tag <startingtag>and ending tag
> </startingtag>in the string since we concatenated several files together.
> And how do I manage splits since hadoop would want to split at every
> default
> setting which might break the well  formed structure of the xml files.
>
>
So you have multiple xmls in a single file and you have many such files..
In that case, the best answer is the StreamXmlRecordReader.

Or you can write your own InputFormat to create splits such that each split
in an xml file in itself, or each record in a split is a complete xml
message.

Other way to go about would be to have a for loop in the driver class and
> provide a file at a time. I dont think it is good way since files are very
> small and we will get almost no parallelization here.
>
> Is there a way that I can input a list or array of files to map function
> and
> do parsing inside map function. How would I take care of split and the tags
> of xml if I do that.
>
> I hope I was more clear this time??
>
> Regards,
> Vipul Sharma,
> Cell: 281-217-0761
>
>
> On Mon, Nov 2, 2009 at 4:00 PM, Amandeep Khurana <am...@gmail.com> wrote:
>
> > Are the xml's in flat files or stored in Hbase?
> >
> > 1. If they are in flat files, you can use the StreamXmlRecordReader if
> that
> > works for you.
> >
> > 2. Or you can read the xml into a single string and process it however
> you
> > want. (This can be done if its in a flat file or stored in an hbase
> table).
> > I have xmls in hbase table and parse and process them as strings.
> >
> > One mapper per file doesnt make sense. If its in HBase, have one mapper
> per
> > region. If they are flat files, depending on how many files you have, you
> > can create mappers. You can tune this for your particular requirement and
> > there is no "right" way to do it.
> >
> > On Mon, Nov 2, 2009 at 3:01 PM, Vipul Sharma <sh...@gmail.com>
> > wrote:
> >
> > > I am working on a mapreduce application that will take input from lots
> of
> > > small xml files rather than one big xml file. Each xml files has some
> > > record
> > > that I want to parse and input data in a hbase table. How should I go
> > about
> > > parsing xml files and input in map functions. Should I have one mapper
> > per
> > > xml file or is there another way of doing this? Thanks for your help
> and
> > > time.
> > >
> > > Regards,
> > > Vipul Sharma,
> > >
> >
>

Re: XML input to map function

Posted by Vipul Sharma <sh...@gmail.com>.
Okay I think I was not clear in my first post about the question. Let me try
again.

I have an application that gets large number of xml files every minute which
are copied over to hdfs. Each file is around 1Mb each and contains several
records. Files are well formed xml files with a starting tag <startingtag>
and end tag </startingtag> in each xml file. I want to parse these files and
put relevant output data in hbase.

Now as an input to map function I can read all the unread files in a string
and parse them inside map function using DOM or sometjing like that. But
then how do I deal with multiple starting tag <startingtag>and ending tag
</startingtag>in the string since we concatenated several files together.
And how do I manage splits since hadoop would want to split at every default
setting which might break the well  formed structure of the xml files.

Other way to go about would be to have a for loop in the driver class and
provide a file at a time. I dont think it is good way since files are very
small and we will get almost no parallelization here.

Is there a way that I can input a list or array of files to map function and
do parsing inside map function. How would I take care of split and the tags
of xml if I do that.

I hope I was more clear this time??

Regards,
Vipul Sharma,
Cell: 281-217-0761


On Mon, Nov 2, 2009 at 4:00 PM, Amandeep Khurana <am...@gmail.com> wrote:

> Are the xml's in flat files or stored in Hbase?
>
> 1. If they are in flat files, you can use the StreamXmlRecordReader if that
> works for you.
>
> 2. Or you can read the xml into a single string and process it however you
> want. (This can be done if its in a flat file or stored in an hbase table).
> I have xmls in hbase table and parse and process them as strings.
>
> One mapper per file doesnt make sense. If its in HBase, have one mapper per
> region. If they are flat files, depending on how many files you have, you
> can create mappers. You can tune this for your particular requirement and
> there is no "right" way to do it.
>
> On Mon, Nov 2, 2009 at 3:01 PM, Vipul Sharma <sh...@gmail.com>
> wrote:
>
> > I am working on a mapreduce application that will take input from lots of
> > small xml files rather than one big xml file. Each xml files has some
> > record
> > that I want to parse and input data in a hbase table. How should I go
> about
> > parsing xml files and input in map functions. Should I have one mapper
> per
> > xml file or is there another way of doing this? Thanks for your help and
> > time.
> >
> > Regards,
> > Vipul Sharma,
> >
>

Re: XML input to map function

Posted by Amandeep Khurana <am...@gmail.com>.
Are the xml's in flat files or stored in Hbase?

1. If they are in flat files, you can use the StreamXmlRecordReader if that
works for you.

2. Or you can read the xml into a single string and process it however you
want. (This can be done if its in a flat file or stored in an hbase table).
I have xmls in hbase table and parse and process them as strings.

One mapper per file doesnt make sense. If its in HBase, have one mapper per
region. If they are flat files, depending on how many files you have, you
can create mappers. You can tune this for your particular requirement and
there is no "right" way to do it.

On Mon, Nov 2, 2009 at 3:01 PM, Vipul Sharma <sh...@gmail.com> wrote:

> I am working on a mapreduce application that will take input from lots of
> small xml files rather than one big xml file. Each xml files has some
> record
> that I want to parse and input data in a hbase table. How should I go about
> parsing xml files and input in map functions. Should I have one mapper per
> xml file or is there another way of doing this? Thanks for your help and
> time.
>
> Regards,
> Vipul Sharma,
>