You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Khaled BEN BAHRI <Kh...@it-sudparis.eu> on 2010/07/13 10:17:07 UTC

Inputs of Mapreduce

Hello to all

I'm novice in working with mapreduce and i'm developping a mapreduce  
function that take xml documents as inputs.

How can i make input files and precise it to the map function

Thanks for help

Best regards
Khaled


Re: Inputs of Mapreduce

Posted by Paul Ingles <pa...@oobaloo.co.uk>.
We tried using the hadoop streaming xml format a while ago and it didn't quite go as expected. I don't remember why, but, it gave some weird results- missing some records off, getting to 98% complete and then stopping etc.

The Mahout project also has an XmlInputFormat [1] that we ended up using. I also posted something on my blog about it all [2], and a little about my understanding (so far) of input formats and record readers etc.

Hope that helps,
Paul

1. http://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java
2. http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

On 13 Jul 2010, at 12:26, Shuja Rehman wrote:

> Hi Khaled,
> XML files can be processed using hadoop streaming. check out the following
> link.
> 
> http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F
> 
> Regards
> Shuja
> 
> On Tue, Jul 13, 2010 at 2:24 PM, edward choi <mp...@gmail.com> wrote:
> 
>> Khaled,
>> 
>> Hadoop mapreduce innately takes in file line by line.
>> XML files are not comprised of single lines.
>> So you will have to pack a single xml document into a single line.
>> Or you can make your own input format, which you need to refer to a guide
>> book.
>> 
>> 2010/7/13 Khaled BEN BAHRI <Kh...@it-sudparis.eu>
>> 
>>> Hello to all
>>> 
>>> I'm novice in working with mapreduce and i'm developping a mapreduce
>>> function that take xml documents as inputs.
>>> 
>>> How can i make input files and precise it to the map function
>>> 
>>> Thanks for help
>>> 
>>> Best regards
>>> Khaled
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Regards
> Shuja-ur-Rehman Baig
> _________________________________
> MS CS - School of Science and Engineering
> Lahore University of Management Sciences (LUMS)
> Sector U, DHA, Lahore, 54792, Pakistan
> Cell: +92 3214207445


Re: Inputs of Mapreduce

Posted by Shuja Rehman <sh...@gmail.com>.
Hi Khaled,
XML files can be processed using hadoop streaming. check out the following
link.

http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F

Regards
Shuja

On Tue, Jul 13, 2010 at 2:24 PM, edward choi <mp...@gmail.com> wrote:

> Khaled,
>
> Hadoop mapreduce innately takes in file line by line.
> XML files are not comprised of single lines.
> So you will have to pack a single xml document into a single line.
> Or you can make your own input format, which you need to refer to a guide
> book.
>
> 2010/7/13 Khaled BEN BAHRI <Kh...@it-sudparis.eu>
>
> > Hello to all
> >
> > I'm novice in working with mapreduce and i'm developping a mapreduce
> > function that take xml documents as inputs.
> >
> > How can i make input files and precise it to the map function
> >
> > Thanks for help
> >
> > Best regards
> > Khaled
> >
> >
>



-- 
Regards
Shuja-ur-Rehman Baig
_________________________________
MS CS - School of Science and Engineering
Lahore University of Management Sciences (LUMS)
Sector U, DHA, Lahore, 54792, Pakistan
Cell: +92 3214207445

Re: Inputs of Mapreduce

Posted by edward choi <mp...@gmail.com>.
Khaled,

Hadoop mapreduce innately takes in file line by line.
XML files are not comprised of single lines.
So you will have to pack a single xml document into a single line.
Or you can make your own input format, which you need to refer to a guide
book.

2010/7/13 Khaled BEN BAHRI <Kh...@it-sudparis.eu>

> Hello to all
>
> I'm novice in working with mapreduce and i'm developping a mapreduce
> function that take xml documents as inputs.
>
> How can i make input files and precise it to the map function
>
> Thanks for help
>
> Best regards
> Khaled
>
>