You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Alexandre Jaquet <al...@gmail.com> on 2009/06/12 22:31:11 UTC

parsing open xml

Hi,

Does hadoop and map / reduce will allow me to parse large quantity of open
xml files distributed inside the same filesystem but using multipe jobs ?

Thx

Alexandre Jaquet

Re: parsing open xml

Posted by Alexandre Jaquet <al...@gmail.com>.

Hi Alex,

First thanks again for responding, I saw that katta within their search
engine already allow to do full text search within pdf box to search and
index pdf files ;) I will study your video training tonigth to learn how to
implement the job for xml within your video :))

2009/6/15 Alex Loddengaard <al...@cloudera.com>

> Well, you define what your job does, but I expect that nearly all MR jobs
> do
> their parsing in the mapper, not in the reducer.  You may find these two
> videos useful:
>
> <http://www.cloudera.com/hadoop-training-mapreduce-hdfs>
> <http://www.cloudera.com/hadoop-training-programming-with-hadoop>
>
> Hope this helps!
>
> Alex
>
> On Sat, Jun 13, 2009 at 1:42 AM, Alexandre Jaquet <alexjaquet@gmail.com
> >wrote:
>
> > Thanks Alex,
> >
> > Parsing the documents is a task done within the reducer ? we collect the
> > datas (document input) within a mapper and then parse it ?
> >
> > Thanks in advance
> >
> > Alexandre Jaquet
> >
> > 2009/6/13 Alex Loddengaard <al...@cloudera.com>
> >
> > > When you refer to "filesystem," do you mean HDFS?
> > >
> > > It's very common to store lots of text files in HDFS and run multiple
> > jobs
> > > to process / learn about those text files.  As for XML support, you can
> > use
> > > Java libraries (or Python libraries if you're using Hadoop streaming)
> to
> > > parse the XML; Hadoop itself doesn't have much XML support.  I hope
> this
> > > answers your question.
> > >
> > > Alex
> > >
> > > On Fri, Jun 12, 2009 at 1:31 PM, Alexandre Jaquet <
> alexjaquet@gmail.com
> > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > Does hadoop and map / reduce will allow me to parse large quantity of
> > > open
> > > > xml files distributed inside the same filesystem but using multipe
> jobs
> > ?
> > > >
> > > > Thx
> > > >
> > > > Alexandre Jaquet
> > > >
> > >
> >
>

Re: parsing open xml

Posted by Alex Loddengaard <al...@cloudera.com>.

Well, you define what your job does, but I expect that nearly all MR jobs do
their parsing in the mapper, not in the reducer.  You may find these two
videos useful:

<http://www.cloudera.com/hadoop-training-mapreduce-hdfs>
<http://www.cloudera.com/hadoop-training-programming-with-hadoop>

Hope this helps!

Alex

On Sat, Jun 13, 2009 at 1:42 AM, Alexandre Jaquet <al...@gmail.com>wrote:

> Thanks Alex,
>
> Parsing the documents is a task done within the reducer ? we collect the
> datas (document input) within a mapper and then parse it ?
>
> Thanks in advance
>
> Alexandre Jaquet
>
> 2009/6/13 Alex Loddengaard <al...@cloudera.com>
>
> > When you refer to "filesystem," do you mean HDFS?
> >
> > It's very common to store lots of text files in HDFS and run multiple
> jobs
> > to process / learn about those text files.  As for XML support, you can
> use
> > Java libraries (or Python libraries if you're using Hadoop streaming) to
> > parse the XML; Hadoop itself doesn't have much XML support.  I hope this
> > answers your question.
> >
> > Alex
> >
> > On Fri, Jun 12, 2009 at 1:31 PM, Alexandre Jaquet <alexjaquet@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > Does hadoop and map / reduce will allow me to parse large quantity of
> > open
> > > xml files distributed inside the same filesystem but using multipe jobs
> ?
> > >
> > > Thx
> > >
> > > Alexandre Jaquet
> > >
> >
>

Re: parsing open xml

Posted by Alexandre Jaquet <al...@gmail.com>.

Thanks Alex,

Parsing the documents is a task done within the reducer ? we collect the
datas (document input) within a mapper and then parse it ?

Thanks in advance

Alexandre Jaquet

2009/6/13 Alex Loddengaard <al...@cloudera.com>

> When you refer to "filesystem," do you mean HDFS?
>
> It's very common to store lots of text files in HDFS and run multiple jobs
> to process / learn about those text files.  As for XML support, you can use
> Java libraries (or Python libraries if you're using Hadoop streaming) to
> parse the XML; Hadoop itself doesn't have much XML support.  I hope this
> answers your question.
>
> Alex
>
> On Fri, Jun 12, 2009 at 1:31 PM, Alexandre Jaquet <alexjaquet@gmail.com
> >wrote:
>
> > Hi,
> >
> > Does hadoop and map / reduce will allow me to parse large quantity of
> open
> > xml files distributed inside the same filesystem but using multipe jobs ?
> >
> > Thx
> >
> > Alexandre Jaquet
> >
>

Re: parsing open xml

Posted by Alex Loddengaard <al...@cloudera.com>.

When you refer to "filesystem," do you mean HDFS?

It's very common to store lots of text files in HDFS and run multiple jobs
to process / learn about those text files.  As for XML support, you can use
Java libraries (or Python libraries if you're using Hadoop streaming) to
parse the XML; Hadoop itself doesn't have much XML support.  I hope this
answers your question.

Alex

On Fri, Jun 12, 2009 at 1:31 PM, Alexandre Jaquet <al...@gmail.com>wrote:

> Hi,
>
> Does hadoop and map / reduce will allow me to parse large quantity of open
> xml files distributed inside the same filesystem but using multipe jobs ?
>
> Thx
>
> Alexandre Jaquet
>