You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by Peter Minearo <Pe...@Reardencommerce.com> on 2010/07/16 23:00:00 UTC

Hadoop and XML

I have an XML file that has sparse data in it.  I am running a MapReduce
Job that reads in an XML file, pulls out a Key from within the XML
snippet and then hands back the Key and the XML snippet (as the Value)
to the OutputCollector.  The reason is to sort the file back into order.
Below is the snippet of code. 
 
public class XmlMapper extends MapReduceBase implements Mapper {
 
 private Text keyText = new Text();
 private Text valueText = new Text();
 
 @SuppressWarnings("unchecked")
 public void map(Object key, Object value, OutputCollector output,
Reporter reporter) throws IOException {
  Text valueText = (Text)value;
  String valueString = new String(valueText.getBytes(), "UTF-8");
  String keyString = getXmlKey(valueString);
  getKeyText().set(keyString);
  getValueText().set(valueString);
  output.collect(getKeyText(), getValueText());
 }
 
 
 public Text getKeyText() {
  return keyText;
 }
 

 public void setKeyText(Text keyText) {
  this.keyText = keyText;
 }
 

 public Text getValueText() {
  return valueText;
 }
 

 public void setValueText(Text valueText) {
  this.valueText = valueText;
 }
 

 private String getXmlKey(String value) {
        // Get the Key from the XML in the value.
 }
 
}
 
The XML snippet from the Value is fine when it is passed into the map()
method.  I am not changing any data either, just pulling out information
for the key.  The problem I am seeing is between the Map phase and the
Reduce phase, the XML is getting munged.  For Example:
 
 </PrivateRate>
  </PrivateRateSet>te>
 
It is my understanding that Hadoop uses the same instance of the Key and
Value object when calling the Map method.  What changes is the data
within those instances.  So, I ran an experiment where I do not have
different Key or Value Text Objects.  I reuse the ones passed into the
method, like below:
 
public class XmlMapper extends MapReduceBase implements Mapper {
 
 @SuppressWarnings("unchecked")
 public void map(Object key, Object value, OutputCollector output,
Reporter reporter) throws IOException {
  Text keyText = (Text)key;
  Text valueText = (Text)value;
  String valueString = new String(valueText.getBytes(), "UTF-8");
  String keyString = getXmlKey(valueString);
  keyText.set(keyString);
  valueText.set(valueString);
  output.collect(keyText, valueText);
 }
 
 
 private String getXmlKey(String value) {
        // Get the Key from the XML in the value.
 }
 
}
 
What was interesting about this is the fact that the XML was getting
munged within the Map Phase.  When I changed over to the code at the
top, the Map phase was fine.  However, the Reduce phase picks up the
munged XML.  Trying to debug the problem, I came across this method in
the Text Object:
 
public void set(byte[] utf8, int start, int len) {
    setCapacity(len, false);
    System.arraycopy(utf8, start, bytes, 0, len);
    this.length = len;
}
 
If the "bytes" array had a length of 1000 and the "utf8" array has a
length of 500; doing a System.arraycopy() would only copy the first 500
from "utf8" to "bytes" but leave the last 500 in "bytes" alone.  Could
this be the cause of the XML munging?
 
All of this leads me to a few questions:
 
1) Has anyone successfully used XML snippets as the data format within a
MapReduce job; not just reading from the file but used during the
shuffle?
2) Is anyone seeing this problem with XML or any other format?
3) Does anyone know what is going on?
4) Is this a bug?
 

Thanks,
 
Peter 
 
 

Re: Hadoop and XML

Posted by Scott Carey <sc...@richrelevance.com>.
On Jul 20, 2010, at 11:24 AM, Scott Carey wrote:

> 
>> 
>> This sounds like a bug.
>> 
>> Let's say you create a Text object and drop in a String that sets the byte array length to 200.  Then drop in a a second String that sets the byte array length to 500.  Since, the new length is greater than the previous length; the byte array length is reset to the longer length.  Now, if you drop in a third String that would set the byte array length to 350; the Text object does not replace the byte array with a new length of 350; it utilizes the greater length of 500 and sets an extra variable to track the "real" length.
>> 
>> So: Text.getBytes().length != Text.getLength()
>> 
>> This does 2 things:
>> 
>> 1. Passes around more data than what is needed
>> 2. Makes the Text object confusing to work with
>> 
>> Text.getBytes().length == Text.getLength() - should be the correct behavior.
>> 
>> 
> 
> I don't think so.  Passing around byte arrays larger than the valid data is common practice in Java for performance reasons.  Hence, the common method signature containing  (byte[] bytes, int len, int offset) and similar.   Creating a new byte array for each resize defeats the purpose of re-using the byte array and the Text object -- lower memory allocation and improved CPU cache locality.  The byte array here is a buffer, it does not represent the entire string.
> 

To be more specific here, shouldn't Text.toString() do the trick?   If Text.toString() doesn't work and does something other than what you expect here, it should be documented and that class should have another helper method that gets you a String from Text.   Calling getBytes() and manually constructing a string means you should know what those bytes represent -- a buffer where the bytes for the string are from index - to Text.getLength().

Re: Hadoop and XML

Posted by Scott Carey <sc...@richrelevance.com>.
> 
> This sounds like a bug.
> 
> Let's say you create a Text object and drop in a String that sets the byte array length to 200.  Then drop in a a second String that sets the byte array length to 500.  Since, the new length is greater than the previous length; the byte array length is reset to the longer length.  Now, if you drop in a third String that would set the byte array length to 350; the Text object does not replace the byte array with a new length of 350; it utilizes the greater length of 500 and sets an extra variable to track the "real" length.
> 
> So: Text.getBytes().length != Text.getLength()
> 
> This does 2 things:
> 
> 1. Passes around more data than what is needed
> 2. Makes the Text object confusing to work with
> 
> Text.getBytes().length == Text.getLength() - should be the correct behavior.
> 
> 

I don't think so.  Passing around byte arrays larger than the valid data is common practice in Java for performance reasons.  Hence, the common method signature containing  (byte[] bytes, int len, int offset) and similar.   Creating a new byte array for each resize defeats the purpose of re-using the byte array and the Text object -- lower memory allocation and improved CPU cache locality.  The byte array here is a buffer, it does not represent the entire string.


RE: Hadoop and XML

Posted by Peter Minearo <Pe...@Reardencommerce.com>.
That is exacly what is happening.  This is the code from the Text class.

public void set(String string) {
    try {
      ByteBuffer bb = encode(string, true);
      bytes = bb.array();
      length = bb.limit();
    }catch(CharacterCodingException e) {
      throw new RuntimeException("Should not have happened " + e.toString()); 
    }
  }


This sounds like a bug.  

Let's say you create a Text object and drop in a String that sets the byte array length to 200.  Then drop in a a second String that sets the byte array length to 500.  Since, the new length is greater than the previous length; the byte array length is reset to the longer length.  Now, if you drop in a third String that would set the byte array length to 350; the Text object does not replace the byte array with a new length of 350; it utilizes the greater length of 500 and sets an extra variable to track the "real" length.  

So: Text.getBytes().length != Text.getLength()

This does 2 things:

1. Passes around more data than what is needed
2. Makes the Text object confusing to work with

Text.getBytes().length == Text.getLength() - should be the correct behavior.



-----Original Message-----
From: Jeff Bean [mailto:jwfbean@cloudera.com]
Sent: Tue 7/20/2010 9:23 AM
To: general@hadoop.apache.org
Subject: Re: Hadoop and XML
 
data.length is the length of the byte array.

Text.getLength() most likely returns a different value than getBytes.length.

Hadoop reuses box class objects like Text, so what it's probably doing is
writing over the byte array, lengthening it as necessary, and just updating
a separate length attribute.

Jeff

On Tue, Jul 20, 2010 at 8:56 AM, Ted Yu <yu...@gmail.com> wrote:

> Interesting.
> String class is able to handle this scenario:
>
>  348       public String(byte[] data, String encoding) throws
> UnsupportedEncodingException {
>  349           this(data, 0, data.length, encoding);
>  350       }
>
>
>
> On Tue, Jul 20, 2010 at 6:01 AM, Jeff Bean <jw...@cloudera.com> wrote:
>
> > I think the problem is here:
> >
> > String valueString = new String(valueText.getBytes(), "UTF-8");
> >
> > Javadoc for Text says:
> >
> > *getBytes<
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29
> > >
> > *()
> >          Returns the raw bytes; however, only data up to
> > getLength()<
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29
> > >is
> > valid.
> >
> > So try getting the length, truncating the byte array at the value
> returned
> > by getLength() and THEN converting it to a String.
> >
> > Jeff
> >
> > On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > For your initial question on Text.set().
> > > Text.setCapacity() allocates new byte array. Since keepData is false,
> old
> > > data wouldn't be copied over.
> > >
> > > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo <
> > > Peter.Minearo@reardencommerce.com> wrote:
> > >
> > > > I am already using XmlInputFormat.  The input into the Map phase is
> not
> > > > the problem.  The problem lays in between the Map and Reduce phase.
> > > >
> > > > BTW - The article is correct.  DO NOT USE StreamXmlRecordReader.
> > > > XmlInputFormat is a lot faster.  From my testing,
> StreamXmlRecordReader
> > > > took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat
> > was
> > > > under 2 minutes. (Using 2 Core, 8GB machines)
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Ted Yu [mailto:yuzhihong@gmail.com]
> > > > Sent: Friday, July 16, 2010 9:44 PM
> > > > To: general@hadoop.apache.org
> > > > Subject: Re: Hadoop and XML
> > > >
> > > > From an earlier post:
> > > >
> http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
> > > >
> > > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
> > > > Peter.Minearo@reardencommerce.com> wrote:
> > > >
> > > > > Moving the variable to a local variable did not seem to work:
> > > > >
> > > > >
> > > > > </PrivateRateSet>vateRateSet>
> > > > >
> > > > >
> > > > >
> > > > > public void map(Object key, Object value, OutputCollector output,
> > > > > Reporter
> > > > > reporter) throws IOException {
> > > > >                Text valueText = (Text)value;
> > > > >                String valueString = new
> String(valueText.getBytes(),
> > > > > "UTF-8");
> > > > >                String keyString = getXmlKey(valueString);
> > > > >                 Text returnKeyText = new Text();
> > > > >                Text returnValueText = new Text();
> > > > >                returnKeyText.set(keyString);
> > > > >                returnValueText.set(valueString);
> > > > >                output.collect(returnKeyText, returnValueText); }
> > > > >
> > > > > -----Original Message-----
> > > > > From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
> > > > > Sent: Fri 7/16/2010 2:51 PM
> > > > > To: general@hadoop.apache.org
> > > > > Subject: RE: Hadoop and XML
> > > > >
> > > > > Whoops....right after I sent it and someone else made a suggestion;
> I
> > > > > realized what question 2 was about.  I can try that, but wouldn't
> > that
> > > >
> > > > > cause Object bloat?  During the Hadoop training I went through; it
> > was
> > > >
> > > > > mentioned to reuse the returning Key and Value objects to keep the
> > > > > number of Objects created down to a minimum.  Is this not really a
> > > > > valid point?
> > > > >
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
> > > > > Sent: Friday, July 16, 2010 2:44 PM
> > > > > To: general@hadoop.apache.org
> > > > > Subject: RE: Hadoop and XML
> > > > >
> > > > >
> > > > > I am not using multi-threaded Map tasks.  Also, if I understand
> your
> > > > > second question correctly:
> > > > > "Also can you try creating the output key and values in the map
> > > > > method(method lacal) ?"
> > > > > In the first code snippet I am doing exactly that.
> > > > >
> > > > > Below is the class that runs the Job.
> > > > >
> > > > > public class HadoopJobClient {
> > > > >
> > > > >        private static final Log LOGGER =
> > > > > LogFactory.getLog(Prds.class.getName());
> > > > >
> > > > >        public static void main(String[] args) {
> > > > >                JobConf conf = new JobConf(Prds.class);
> > > > >
> > > > >                conf.set("xmlinput.start", "<PrivateRateSet>");
> > > > >                conf.set("xmlinput.end", "</PrivateRateSet>");
> > > > >
> > > > >                conf.setJobName("PRDS Parse");
> > > > >
> > > > >                conf.setOutputKeyClass(Text.class);
> > > > >                conf.setOutputValueClass(Text.class);
> > > > >
> > > > >                conf.setMapperClass(PrdsMapper.class);
> > > > >                conf.setReducerClass(PrdsReducer.class);
> > > > >
> > > > >                conf.setInputFormat(XmlInputFormat.class);
> > > > >                conf.setOutputFormat(TextOutputFormat.class);
> > > > >
> > > > >                FileInputFormat.setInputPaths(conf, new
> > Path(args[0]));
> > > > >                FileOutputFormat.setOutputPath(conf, new
> > > > > Path(args[1]));
> > > > >
> > > > >                // Run the job
> > > > >                try {
> > > > >                        JobClient.runJob(conf);
> > > > >                } catch (IOException e) {
> > > > >                        LOGGER.error(e.getMessage(), e);
> > > > >                }
> > > > >
> > > > >        }
> > > > >
> > > > >
> > > > > }
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Soumya Banerjee [mailto:soumya.sbanerjee@gmail.com]
> > > > > Sent: Fri 7/16/2010 2:29 PM
> > > > > To: general@hadoop.apache.org
> > > > > Subject: Re: Hadoop and XML
> > > > >
> > > > > Hi,
> > > > >
> > > > > Can you please share the code of the job submission client ?
> > > > >
> > > > > Also can you try creating the output key and values in the map
> > > > > method(method
> > > > > lacal) ?
> > > > > Make sure you are not using multi threaded map task configuration.
> > > > >
> > > > > map()
> > > > > {
> > > > > private Text keyText = new Text();
> > > > >  private Text valueText = new Text();
> > > > >
> > > > > //rest of the code
> > > > > }
> > > > >
> > > > > Soumya.
> > > > >
> > > > > On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo <
> > > > > Peter.Minearo@reardencommerce.com> wrote:
> > > > >
> > > > > > I have an XML file that has sparse data in it.  I am running a
> > > > > > MapReduce Job that reads in an XML file, pulls out a Key from
> > within
> > > >
> > > > > > the XML snippet and then hands back the Key and the XML snippet
> (as
> > > > > > the Value) to the OutputCollector.  The reason is to sort the
> file
> > > > > back into order.
> > > > > > Below is the snippet of code.
> > > > > >
> > > > > > public class XmlMapper extends MapReduceBase implements Mapper {
> > > > > >
> > > > > >  private Text keyText = new Text();
> > > > > >  private Text valueText = new Text();
> > > > > >
> > > > > >  @SuppressWarnings("unchecked")
> > > > > >  public void map(Object key, Object value, OutputCollector
> output,
> > > > > > Reporter reporter) throws IOException {  Text valueText =
> > > > > > (Text)value;
> > > > >
> > > > > > String valueString = new String(valueText.getBytes(), "UTF-8");
> > > > > > String keyString = getXmlKey(valueString);
> > > > > > getKeyText().set(keyString);  getValueText().set(valueString);
> > > > > > output.collect(getKeyText(), getValueText());  }
> > > > > >
> > > > > >
> > > > > >  public Text getKeyText() {
> > > > > >  return keyText;
> > > > > >  }
> > > > > >
> > > > > >
> > > > > >  public void setKeyText(Text keyText) {  this.keyText = keyText;
>  }
> > > > > >
> > > > > >
> > > > > >  public Text getValueText() {
> > > > > >  return valueText;
> > > > > >  }
> > > > > >
> > > > > >
> > > > > >  public void setValueText(Text valueText) {  this.valueText =
> > > > > > valueText;  }
> > > > > >
> > > > > >
> > > > > >  private String getXmlKey(String value) {
> > > > > >        // Get the Key from the XML in the value.
> > > > > >  }
> > > > > >
> > > > > > }
> > > > > >
> > > > > > The XML snippet from the Value is fine when it is passed into the
> > > > > > map() method.  I am not changing any data either, just pulling
> out
> > > > > > information for the key.  The problem I am seeing is between the
> > Map
> > > >
> > > > > > phase and the Reduce phase, the XML is getting munged.  For
> > Example:
> > > > > >
> > > > > >  </PrivateRate>
> > > > > >  </PrivateRateSet>te>
> > > > > >
> > > > > > It is my understanding that Hadoop uses the same instance of the
> > Key
> > > >
> > > > > > and Value object when calling the Map method.  What changes is
> the
> > > > > > data within those instances.  So, I ran an experiment where I do
> > not
> > > >
> > > > > > have different Key or Value Text Objects.  I reuse the ones
> passed
> > > > > > into the method, like below:
> > > > > >
> > > > > > public class XmlMapper extends MapReduceBase implements Mapper {
> > > > > >
> > > > > >  @SuppressWarnings("unchecked")
> > > > > >  public void map(Object key, Object value, OutputCollector
> output,
> > > > > > Reporter reporter) throws IOException {  Text keyText =
> (Text)key;
> > > > > > Text valueText = (Text)value;  String valueString = new
> > > > > > String(valueText.getBytes(), "UTF-8");  String keyString =
> > > > > > getXmlKey(valueString);  keyText.set(keyString);
> > > > > > valueText.set(valueString);  output.collect(keyText, valueText);
>  }
> > > > > >
> > > > > >
> > > > > >  private String getXmlKey(String value) {
> > > > > >        // Get the Key from the XML in the value.
> > > > > >  }
> > > > > >
> > > > > > }
> > > > > >
> > > > > > What was interesting about this is the fact that the XML was
> > getting
> > > >
> > > > > > munged within the Map Phase.  When I changed over to the code at
> > the
> > > >
> > > > > > top, the Map phase was fine.  However, the Reduce phase picks up
> > the
> > > >
> > > > > > munged XML.  Trying to debug the problem, I came across this
> method
> > > > > > in
> > > > >
> > > > > > the Text Object:
> > > > > >
> > > > > > public void set(byte[] utf8, int start, int len) {
> > > > > >    setCapacity(len, false);
> > > > > >    System.arraycopy(utf8, start, bytes, 0, len);
> > > > > >    this.length = len;
> > > > > > }
> > > > > >
> > > > > > If the "bytes" array had a length of 1000 and the "utf8" array
> has
> > a
> > > >
> > > > > > length of 500; doing a System.arraycopy() would only copy the
> first
> > > > > > 500 from "utf8" to "bytes" but leave the last 500 in "bytes"
> alone.
> > > > > > Could this be the cause of the XML munging?
> > > > > >
> > > > > > All of this leads me to a few questions:
> > > > > >
> > > > > > 1) Has anyone successfully used XML snippets as the data format
> > > > > > within
> > > > >
> > > > > > a MapReduce job; not just reading from the file but used during
> the
> > > > > > shuffle?
> > > > > > 2) Is anyone seeing this problem with XML or any other format?
> > > > > > 3) Does anyone know what is going on?
> > > > > > 4) Is this a bug?
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Peter
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>


Re: Hadoop and XML

Posted by Ted Yu <yu...@gmail.com>.
I also added Peter's comment to the JIRA I logged:
https://issues.apache.org/jira/browse/HADOOP-6868

On Tue, Jul 20, 2010 at 9:38 AM, Ted Yu <yu...@gmail.com> wrote:

> So the correct call should be:
> String valueString = new String(valueText.getBytes(), 0,
> valueText.getLength(), "UTF-8");
>
> Cheers
>
>
> On Tue, Jul 20, 2010 at 9:23 AM, Jeff Bean <jw...@cloudera.com> wrote:
>
>> data.length is the length of the byte array.
>>
>> Text.getLength() most likely returns a different value than
>> getBytes.length.
>>
>> Hadoop reuses box class objects like Text, so what it's probably doing is
>> writing over the byte array, lengthening it as necessary, and just
>> updating
>> a separate length attribute.
>>
>> Jeff
>>
>> On Tue, Jul 20, 2010 at 8:56 AM, Ted Yu <yu...@gmail.com> wrote:
>>
>> > Interesting.
>> > String class is able to handle this scenario:
>> >
>> >  348       public String(byte[] data, String encoding) throws
>> > UnsupportedEncodingException {
>> >  349           this(data, 0, data.length, encoding);
>> >  350       }
>> >
>> >
>> >
>> > On Tue, Jul 20, 2010 at 6:01 AM, Jeff Bean <jw...@cloudera.com>
>> wrote:
>> >
>> > > I think the problem is here:
>> > >
>> > > String valueString = new String(valueText.getBytes(), "UTF-8");
>> > >
>> > > Javadoc for Text says:
>> > >
>> > > *getBytes<
>> > >
>> >
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29
>> > > >
>> > > *()
>> > >          Returns the raw bytes; however, only data up to
>> > > getLength()<
>> > >
>> >
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29
>> > > >is
>> > > valid.
>> > >
>> > > So try getting the length, truncating the byte array at the value
>> > returned
>> > > by getLength() and THEN converting it to a String.
>> > >
>> > > Jeff
>> > >
>> > > On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <yu...@gmail.com> wrote:
>> > >
>> > > > For your initial question on Text.set().
>> > > > Text.setCapacity() allocates new byte array. Since keepData is
>> false,
>> > old
>> > > > data wouldn't be copied over.
>> > > >
>> > > > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo <
>> > > > Peter.Minearo@reardencommerce.com> wrote:
>> > > >
>> > > > > I am already using XmlInputFormat.  The input into the Map phase
>> is
>> > not
>> > > > > the problem.  The problem lays in between the Map and Reduce
>> phase.
>> > > > >
>> > > > > BTW - The article is correct.  DO NOT USE StreamXmlRecordReader.
>> > > > > XmlInputFormat is a lot faster.  From my testing,
>> > StreamXmlRecordReader
>> > > > > took 8 minutes to read a 1 GB XML document; where as,
>> XmlInputFormat
>> > > was
>> > > > > under 2 minutes. (Using 2 Core, 8GB machines)
>> > > > >
>> > > > >
>> > > > > -----Original Message-----
>> > > > > From: Ted Yu [mailto:yuzhihong@gmail.com]
>> > > > > Sent: Friday, July 16, 2010 9:44 PM
>> > > > > To: general@hadoop.apache.org
>> > > > > Subject: Re: Hadoop and XML
>> > > > >
>> > > > > From an earlier post:
>> > > > >
>> > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
>> > > > >
>> > > > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
>> > > > > Peter.Minearo@reardencommerce.com> wrote:
>> > > > >
>> > > > > > Moving the variable to a local variable did not seem to work:
>> > > > > >
>> > > > > >
>> > > > > > </PrivateRateSet>vateRateSet>
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > public void map(Object key, Object value, OutputCollector
>> output,
>> > > > > > Reporter
>> > > > > > reporter) throws IOException {
>> > > > > >                Text valueText = (Text)value;
>> > > > > >                String valueString = new
>> > String(valueText.getBytes(),
>> > > > > > "UTF-8");
>> > > > > >                String keyString = getXmlKey(valueString);
>> > > > > >                 Text returnKeyText = new Text();
>> > > > > >                Text returnValueText = new Text();
>> > > > > >                returnKeyText.set(keyString);
>> > > > > >                returnValueText.set(valueString);
>> > > > > >                output.collect(returnKeyText, returnValueText); }
>> > > > > >
>> > > > > > -----Original Message-----
>> > > > > > From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
>> > > > > > Sent: Fri 7/16/2010 2:51 PM
>> > > > > > To: general@hadoop.apache.org
>> > > > > > Subject: RE: Hadoop and XML
>> > > > > >
>> > > > > > Whoops....right after I sent it and someone else made a
>> suggestion;
>> > I
>> > > > > > realized what question 2 was about.  I can try that, but
>> wouldn't
>> > > that
>> > > > >
>> > > > > > cause Object bloat?  During the Hadoop training I went through;
>> it
>> > > was
>> > > > >
>> > > > > > mentioned to reuse the returning Key and Value objects to keep
>> the
>> > > > > > number of Objects created down to a minimum.  Is this not really
>> a
>> > > > > > valid point?
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > -----Original Message-----
>> > > > > > From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
>> > > > > > Sent: Friday, July 16, 2010 2:44 PM
>> > > > > > To: general@hadoop.apache.org
>> > > > > > Subject: RE: Hadoop and XML
>> > > > > >
>> > > > > >
>> > > > > > I am not using multi-threaded Map tasks.  Also, if I understand
>> > your
>> > > > > > second question correctly:
>> > > > > > "Also can you try creating the output key and values in the map
>> > > > > > method(method lacal) ?"
>> > > > > > In the first code snippet I am doing exactly that.
>> > > > > >
>> > > > > > Below is the class that runs the Job.
>> > > > > >
>> > > > > > public class HadoopJobClient {
>> > > > > >
>> > > > > >        private static final Log LOGGER =
>> > > > > > LogFactory.getLog(Prds.class.getName());
>> > > > > >
>> > > > > >        public static void main(String[] args) {
>> > > > > >                JobConf conf = new JobConf(Prds.class);
>> > > > > >
>> > > > > >                conf.set("xmlinput.start", "<PrivateRateSet>");
>> > > > > >                conf.set("xmlinput.end", "</PrivateRateSet>");
>> > > > > >
>> > > > > >                conf.setJobName("PRDS Parse");
>> > > > > >
>> > > > > >                conf.setOutputKeyClass(Text.class);
>> > > > > >                conf.setOutputValueClass(Text.class);
>> > > > > >
>> > > > > >                conf.setMapperClass(PrdsMapper.class);
>> > > > > >                conf.setReducerClass(PrdsReducer.class);
>> > > > > >
>> > > > > >                conf.setInputFormat(XmlInputFormat.class);
>> > > > > >                conf.setOutputFormat(TextOutputFormat.class);
>> > > > > >
>> > > > > >                FileInputFormat.setInputPaths(conf, new
>> > > Path(args[0]));
>> > > > > >                FileOutputFormat.setOutputPath(conf, new
>> > > > > > Path(args[1]));
>> > > > > >
>> > > > > >                // Run the job
>> > > > > >                try {
>> > > > > >                        JobClient.runJob(conf);
>> > > > > >                } catch (IOException e) {
>> > > > > >                        LOGGER.error(e.getMessage(), e);
>> > > > > >                }
>> > > > > >
>> > > > > >        }
>> > > > > >
>> > > > > >
>> > > > > > }
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > -----Original Message-----
>> > > > > > From: Soumya Banerjee [mailto:soumya.sbanerjee@gmail.com]
>> > > > > > Sent: Fri 7/16/2010 2:29 PM
>> > > > > > To: general@hadoop.apache.org
>> > > > > > Subject: Re: Hadoop and XML
>> > > > > >
>> > > > > > Hi,
>> > > > > >
>> > > > > > Can you please share the code of the job submission client ?
>> > > > > >
>> > > > > > Also can you try creating the output key and values in the map
>> > > > > > method(method
>> > > > > > lacal) ?
>> > > > > > Make sure you are not using multi threaded map task
>> configuration.
>> > > > > >
>> > > > > > map()
>> > > > > > {
>> > > > > > private Text keyText = new Text();
>> > > > > >  private Text valueText = new Text();
>> > > > > >
>> > > > > > //rest of the code
>> > > > > > }
>> > > > > >
>> > > > > > Soumya.
>> > > > > >
>> > > > > > On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo <
>> > > > > > Peter.Minearo@reardencommerce.com> wrote:
>> > > > > >
>> > > > > > > I have an XML file that has sparse data in it.  I am running a
>> > > > > > > MapReduce Job that reads in an XML file, pulls out a Key from
>> > > within
>> > > > >
>> > > > > > > the XML snippet and then hands back the Key and the XML
>> snippet
>> > (as
>> > > > > > > the Value) to the OutputCollector.  The reason is to sort the
>> > file
>> > > > > > back into order.
>> > > > > > > Below is the snippet of code.
>> > > > > > >
>> > > > > > > public class XmlMapper extends MapReduceBase implements Mapper
>> {
>> > > > > > >
>> > > > > > >  private Text keyText = new Text();
>> > > > > > >  private Text valueText = new Text();
>> > > > > > >
>> > > > > > >  @SuppressWarnings("unchecked")
>> > > > > > >  public void map(Object key, Object value, OutputCollector
>> > output,
>> > > > > > > Reporter reporter) throws IOException {  Text valueText =
>> > > > > > > (Text)value;
>> > > > > >
>> > > > > > > String valueString = new String(valueText.getBytes(),
>> "UTF-8");
>> > > > > > > String keyString = getXmlKey(valueString);
>> > > > > > > getKeyText().set(keyString);  getValueText().set(valueString);
>> > > > > > > output.collect(getKeyText(), getValueText());  }
>> > > > > > >
>> > > > > > >
>> > > > > > >  public Text getKeyText() {
>> > > > > > >  return keyText;
>> > > > > > >  }
>> > > > > > >
>> > > > > > >
>> > > > > > >  public void setKeyText(Text keyText) {  this.keyText =
>> keyText;
>> >  }
>> > > > > > >
>> > > > > > >
>> > > > > > >  public Text getValueText() {
>> > > > > > >  return valueText;
>> > > > > > >  }
>> > > > > > >
>> > > > > > >
>> > > > > > >  public void setValueText(Text valueText) {  this.valueText =
>> > > > > > > valueText;  }
>> > > > > > >
>> > > > > > >
>> > > > > > >  private String getXmlKey(String value) {
>> > > > > > >        // Get the Key from the XML in the value.
>> > > > > > >  }
>> > > > > > >
>> > > > > > > }
>> > > > > > >
>> > > > > > > The XML snippet from the Value is fine when it is passed into
>> the
>> > > > > > > map() method.  I am not changing any data either, just pulling
>> > out
>> > > > > > > information for the key.  The problem I am seeing is between
>> the
>> > > Map
>> > > > >
>> > > > > > > phase and the Reduce phase, the XML is getting munged.  For
>> > > Example:
>> > > > > > >
>> > > > > > >  </PrivateRate>
>> > > > > > >  </PrivateRateSet>te>
>> > > > > > >
>> > > > > > > It is my understanding that Hadoop uses the same instance of
>> the
>> > > Key
>> > > > >
>> > > > > > > and Value object when calling the Map method.  What changes is
>> > the
>> > > > > > > data within those instances.  So, I ran an experiment where I
>> do
>> > > not
>> > > > >
>> > > > > > > have different Key or Value Text Objects.  I reuse the ones
>> > passed
>> > > > > > > into the method, like below:
>> > > > > > >
>> > > > > > > public class XmlMapper extends MapReduceBase implements Mapper
>> {
>> > > > > > >
>> > > > > > >  @SuppressWarnings("unchecked")
>> > > > > > >  public void map(Object key, Object value, OutputCollector
>> > output,
>> > > > > > > Reporter reporter) throws IOException {  Text keyText =
>> > (Text)key;
>> > > > > > > Text valueText = (Text)value;  String valueString = new
>> > > > > > > String(valueText.getBytes(), "UTF-8");  String keyString =
>> > > > > > > getXmlKey(valueString);  keyText.set(keyString);
>> > > > > > > valueText.set(valueString);  output.collect(keyText,
>> valueText);
>> >  }
>> > > > > > >
>> > > > > > >
>> > > > > > >  private String getXmlKey(String value) {
>> > > > > > >        // Get the Key from the XML in the value.
>> > > > > > >  }
>> > > > > > >
>> > > > > > > }
>> > > > > > >
>> > > > > > > What was interesting about this is the fact that the XML was
>> > > getting
>> > > > >
>> > > > > > > munged within the Map Phase.  When I changed over to the code
>> at
>> > > the
>> > > > >
>> > > > > > > top, the Map phase was fine.  However, the Reduce phase picks
>> up
>> > > the
>> > > > >
>> > > > > > > munged XML.  Trying to debug the problem, I came across this
>> > method
>> > > > > > > in
>> > > > > >
>> > > > > > > the Text Object:
>> > > > > > >
>> > > > > > > public void set(byte[] utf8, int start, int len) {
>> > > > > > >    setCapacity(len, false);
>> > > > > > >    System.arraycopy(utf8, start, bytes, 0, len);
>> > > > > > >    this.length = len;
>> > > > > > > }
>> > > > > > >
>> > > > > > > If the "bytes" array had a length of 1000 and the "utf8" array
>> > has
>> > > a
>> > > > >
>> > > > > > > length of 500; doing a System.arraycopy() would only copy the
>> > first
>> > > > > > > 500 from "utf8" to "bytes" but leave the last 500 in "bytes"
>> > alone.
>> > > > > > > Could this be the cause of the XML munging?
>> > > > > > >
>> > > > > > > All of this leads me to a few questions:
>> > > > > > >
>> > > > > > > 1) Has anyone successfully used XML snippets as the data
>> format
>> > > > > > > within
>> > > > > >
>> > > > > > > a MapReduce job; not just reading from the file but used
>> during
>> > the
>> > > > > > > shuffle?
>> > > > > > > 2) Is anyone seeing this problem with XML or any other format?
>> > > > > > > 3) Does anyone know what is going on?
>> > > > > > > 4) Is this a bug?
>> > > > > > >
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > >
>> > > > > > > Peter
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Hadoop and XML

Posted by Ted Yu <yu...@gmail.com>.
So the correct call should be:
String valueString = new String(valueText.getBytes(), 0,
valueText.getLength(), "UTF-8");

Cheers

On Tue, Jul 20, 2010 at 9:23 AM, Jeff Bean <jw...@cloudera.com> wrote:

> data.length is the length of the byte array.
>
> Text.getLength() most likely returns a different value than
> getBytes.length.
>
> Hadoop reuses box class objects like Text, so what it's probably doing is
> writing over the byte array, lengthening it as necessary, and just updating
> a separate length attribute.
>
> Jeff
>
> On Tue, Jul 20, 2010 at 8:56 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > Interesting.
> > String class is able to handle this scenario:
> >
> >  348       public String(byte[] data, String encoding) throws
> > UnsupportedEncodingException {
> >  349           this(data, 0, data.length, encoding);
> >  350       }
> >
> >
> >
> > On Tue, Jul 20, 2010 at 6:01 AM, Jeff Bean <jw...@cloudera.com> wrote:
> >
> > > I think the problem is here:
> > >
> > > String valueString = new String(valueText.getBytes(), "UTF-8");
> > >
> > > Javadoc for Text says:
> > >
> > > *getBytes<
> > >
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29
> > > >
> > > *()
> > >          Returns the raw bytes; however, only data up to
> > > getLength()<
> > >
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29
> > > >is
> > > valid.
> > >
> > > So try getting the length, truncating the byte array at the value
> > returned
> > > by getLength() and THEN converting it to a String.
> > >
> > > Jeff
> > >
> > > On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > For your initial question on Text.set().
> > > > Text.setCapacity() allocates new byte array. Since keepData is false,
> > old
> > > > data wouldn't be copied over.
> > > >
> > > > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo <
> > > > Peter.Minearo@reardencommerce.com> wrote:
> > > >
> > > > > I am already using XmlInputFormat.  The input into the Map phase is
> > not
> > > > > the problem.  The problem lays in between the Map and Reduce phase.
> > > > >
> > > > > BTW - The article is correct.  DO NOT USE StreamXmlRecordReader.
> > > > > XmlInputFormat is a lot faster.  From my testing,
> > StreamXmlRecordReader
> > > > > took 8 minutes to read a 1 GB XML document; where as,
> XmlInputFormat
> > > was
> > > > > under 2 minutes. (Using 2 Core, 8GB machines)
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Ted Yu [mailto:yuzhihong@gmail.com]
> > > > > Sent: Friday, July 16, 2010 9:44 PM
> > > > > To: general@hadoop.apache.org
> > > > > Subject: Re: Hadoop and XML
> > > > >
> > > > > From an earlier post:
> > > > >
> > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
> > > > >
> > > > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
> > > > > Peter.Minearo@reardencommerce.com> wrote:
> > > > >
> > > > > > Moving the variable to a local variable did not seem to work:
> > > > > >
> > > > > >
> > > > > > </PrivateRateSet>vateRateSet>
> > > > > >
> > > > > >
> > > > > >
> > > > > > public void map(Object key, Object value, OutputCollector output,
> > > > > > Reporter
> > > > > > reporter) throws IOException {
> > > > > >                Text valueText = (Text)value;
> > > > > >                String valueString = new
> > String(valueText.getBytes(),
> > > > > > "UTF-8");
> > > > > >                String keyString = getXmlKey(valueString);
> > > > > >                 Text returnKeyText = new Text();
> > > > > >                Text returnValueText = new Text();
> > > > > >                returnKeyText.set(keyString);
> > > > > >                returnValueText.set(valueString);
> > > > > >                output.collect(returnKeyText, returnValueText); }
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
> > > > > > Sent: Fri 7/16/2010 2:51 PM
> > > > > > To: general@hadoop.apache.org
> > > > > > Subject: RE: Hadoop and XML
> > > > > >
> > > > > > Whoops....right after I sent it and someone else made a
> suggestion;
> > I
> > > > > > realized what question 2 was about.  I can try that, but wouldn't
> > > that
> > > > >
> > > > > > cause Object bloat?  During the Hadoop training I went through;
> it
> > > was
> > > > >
> > > > > > mentioned to reuse the returning Key and Value objects to keep
> the
> > > > > > number of Objects created down to a minimum.  Is this not really
> a
> > > > > > valid point?
> > > > > >
> > > > > >
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
> > > > > > Sent: Friday, July 16, 2010 2:44 PM
> > > > > > To: general@hadoop.apache.org
> > > > > > Subject: RE: Hadoop and XML
> > > > > >
> > > > > >
> > > > > > I am not using multi-threaded Map tasks.  Also, if I understand
> > your
> > > > > > second question correctly:
> > > > > > "Also can you try creating the output key and values in the map
> > > > > > method(method lacal) ?"
> > > > > > In the first code snippet I am doing exactly that.
> > > > > >
> > > > > > Below is the class that runs the Job.
> > > > > >
> > > > > > public class HadoopJobClient {
> > > > > >
> > > > > >        private static final Log LOGGER =
> > > > > > LogFactory.getLog(Prds.class.getName());
> > > > > >
> > > > > >        public static void main(String[] args) {
> > > > > >                JobConf conf = new JobConf(Prds.class);
> > > > > >
> > > > > >                conf.set("xmlinput.start", "<PrivateRateSet>");
> > > > > >                conf.set("xmlinput.end", "</PrivateRateSet>");
> > > > > >
> > > > > >                conf.setJobName("PRDS Parse");
> > > > > >
> > > > > >                conf.setOutputKeyClass(Text.class);
> > > > > >                conf.setOutputValueClass(Text.class);
> > > > > >
> > > > > >                conf.setMapperClass(PrdsMapper.class);
> > > > > >                conf.setReducerClass(PrdsReducer.class);
> > > > > >
> > > > > >                conf.setInputFormat(XmlInputFormat.class);
> > > > > >                conf.setOutputFormat(TextOutputFormat.class);
> > > > > >
> > > > > >                FileInputFormat.setInputPaths(conf, new
> > > Path(args[0]));
> > > > > >                FileOutputFormat.setOutputPath(conf, new
> > > > > > Path(args[1]));
> > > > > >
> > > > > >                // Run the job
> > > > > >                try {
> > > > > >                        JobClient.runJob(conf);
> > > > > >                } catch (IOException e) {
> > > > > >                        LOGGER.error(e.getMessage(), e);
> > > > > >                }
> > > > > >
> > > > > >        }
> > > > > >
> > > > > >
> > > > > > }
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Soumya Banerjee [mailto:soumya.sbanerjee@gmail.com]
> > > > > > Sent: Fri 7/16/2010 2:29 PM
> > > > > > To: general@hadoop.apache.org
> > > > > > Subject: Re: Hadoop and XML
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Can you please share the code of the job submission client ?
> > > > > >
> > > > > > Also can you try creating the output key and values in the map
> > > > > > method(method
> > > > > > lacal) ?
> > > > > > Make sure you are not using multi threaded map task
> configuration.
> > > > > >
> > > > > > map()
> > > > > > {
> > > > > > private Text keyText = new Text();
> > > > > >  private Text valueText = new Text();
> > > > > >
> > > > > > //rest of the code
> > > > > > }
> > > > > >
> > > > > > Soumya.
> > > > > >
> > > > > > On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo <
> > > > > > Peter.Minearo@reardencommerce.com> wrote:
> > > > > >
> > > > > > > I have an XML file that has sparse data in it.  I am running a
> > > > > > > MapReduce Job that reads in an XML file, pulls out a Key from
> > > within
> > > > >
> > > > > > > the XML snippet and then hands back the Key and the XML snippet
> > (as
> > > > > > > the Value) to the OutputCollector.  The reason is to sort the
> > file
> > > > > > back into order.
> > > > > > > Below is the snippet of code.
> > > > > > >
> > > > > > > public class XmlMapper extends MapReduceBase implements Mapper
> {
> > > > > > >
> > > > > > >  private Text keyText = new Text();
> > > > > > >  private Text valueText = new Text();
> > > > > > >
> > > > > > >  @SuppressWarnings("unchecked")
> > > > > > >  public void map(Object key, Object value, OutputCollector
> > output,
> > > > > > > Reporter reporter) throws IOException {  Text valueText =
> > > > > > > (Text)value;
> > > > > >
> > > > > > > String valueString = new String(valueText.getBytes(), "UTF-8");
> > > > > > > String keyString = getXmlKey(valueString);
> > > > > > > getKeyText().set(keyString);  getValueText().set(valueString);
> > > > > > > output.collect(getKeyText(), getValueText());  }
> > > > > > >
> > > > > > >
> > > > > > >  public Text getKeyText() {
> > > > > > >  return keyText;
> > > > > > >  }
> > > > > > >
> > > > > > >
> > > > > > >  public void setKeyText(Text keyText) {  this.keyText =
> keyText;
> >  }
> > > > > > >
> > > > > > >
> > > > > > >  public Text getValueText() {
> > > > > > >  return valueText;
> > > > > > >  }
> > > > > > >
> > > > > > >
> > > > > > >  public void setValueText(Text valueText) {  this.valueText =
> > > > > > > valueText;  }
> > > > > > >
> > > > > > >
> > > > > > >  private String getXmlKey(String value) {
> > > > > > >        // Get the Key from the XML in the value.
> > > > > > >  }
> > > > > > >
> > > > > > > }
> > > > > > >
> > > > > > > The XML snippet from the Value is fine when it is passed into
> the
> > > > > > > map() method.  I am not changing any data either, just pulling
> > out
> > > > > > > information for the key.  The problem I am seeing is between
> the
> > > Map
> > > > >
> > > > > > > phase and the Reduce phase, the XML is getting munged.  For
> > > Example:
> > > > > > >
> > > > > > >  </PrivateRate>
> > > > > > >  </PrivateRateSet>te>
> > > > > > >
> > > > > > > It is my understanding that Hadoop uses the same instance of
> the
> > > Key
> > > > >
> > > > > > > and Value object when calling the Map method.  What changes is
> > the
> > > > > > > data within those instances.  So, I ran an experiment where I
> do
> > > not
> > > > >
> > > > > > > have different Key or Value Text Objects.  I reuse the ones
> > passed
> > > > > > > into the method, like below:
> > > > > > >
> > > > > > > public class XmlMapper extends MapReduceBase implements Mapper
> {
> > > > > > >
> > > > > > >  @SuppressWarnings("unchecked")
> > > > > > >  public void map(Object key, Object value, OutputCollector
> > output,
> > > > > > > Reporter reporter) throws IOException {  Text keyText =
> > (Text)key;
> > > > > > > Text valueText = (Text)value;  String valueString = new
> > > > > > > String(valueText.getBytes(), "UTF-8");  String keyString =
> > > > > > > getXmlKey(valueString);  keyText.set(keyString);
> > > > > > > valueText.set(valueString);  output.collect(keyText,
> valueText);
> >  }
> > > > > > >
> > > > > > >
> > > > > > >  private String getXmlKey(String value) {
> > > > > > >        // Get the Key from the XML in the value.
> > > > > > >  }
> > > > > > >
> > > > > > > }
> > > > > > >
> > > > > > > What was interesting about this is the fact that the XML was
> > > getting
> > > > >
> > > > > > > munged within the Map Phase.  When I changed over to the code
> at
> > > the
> > > > >
> > > > > > > top, the Map phase was fine.  However, the Reduce phase picks
> up
> > > the
> > > > >
> > > > > > > munged XML.  Trying to debug the problem, I came across this
> > method
> > > > > > > in
> > > > > >
> > > > > > > the Text Object:
> > > > > > >
> > > > > > > public void set(byte[] utf8, int start, int len) {
> > > > > > >    setCapacity(len, false);
> > > > > > >    System.arraycopy(utf8, start, bytes, 0, len);
> > > > > > >    this.length = len;
> > > > > > > }
> > > > > > >
> > > > > > > If the "bytes" array had a length of 1000 and the "utf8" array
> > has
> > > a
> > > > >
> > > > > > > length of 500; doing a System.arraycopy() would only copy the
> > first
> > > > > > > 500 from "utf8" to "bytes" but leave the last 500 in "bytes"
> > alone.
> > > > > > > Could this be the cause of the XML munging?
> > > > > > >
> > > > > > > All of this leads me to a few questions:
> > > > > > >
> > > > > > > 1) Has anyone successfully used XML snippets as the data format
> > > > > > > within
> > > > > >
> > > > > > > a MapReduce job; not just reading from the file but used during
> > the
> > > > > > > shuffle?
> > > > > > > 2) Is anyone seeing this problem with XML or any other format?
> > > > > > > 3) Does anyone know what is going on?
> > > > > > > 4) Is this a bug?
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Peter
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Hadoop and XML

Posted by Jeff Bean <jw...@cloudera.com>.
data.length is the length of the byte array.

Text.getLength() most likely returns a different value than getBytes.length.

Hadoop reuses box class objects like Text, so what it's probably doing is
writing over the byte array, lengthening it as necessary, and just updating
a separate length attribute.

Jeff

On Tue, Jul 20, 2010 at 8:56 AM, Ted Yu <yu...@gmail.com> wrote:

> Interesting.
> String class is able to handle this scenario:
>
>  348       public String(byte[] data, String encoding) throws
> UnsupportedEncodingException {
>  349           this(data, 0, data.length, encoding);
>  350       }
>
>
>
> On Tue, Jul 20, 2010 at 6:01 AM, Jeff Bean <jw...@cloudera.com> wrote:
>
> > I think the problem is here:
> >
> > String valueString = new String(valueText.getBytes(), "UTF-8");
> >
> > Javadoc for Text says:
> >
> > *getBytes<
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29
> > >
> > *()
> >          Returns the raw bytes; however, only data up to
> > getLength()<
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29
> > >is
> > valid.
> >
> > So try getting the length, truncating the byte array at the value
> returned
> > by getLength() and THEN converting it to a String.
> >
> > Jeff
> >
> > On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > For your initial question on Text.set().
> > > Text.setCapacity() allocates new byte array. Since keepData is false,
> old
> > > data wouldn't be copied over.
> > >
> > > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo <
> > > Peter.Minearo@reardencommerce.com> wrote:
> > >
> > > > I am already using XmlInputFormat.  The input into the Map phase is
> not
> > > > the problem.  The problem lays in between the Map and Reduce phase.
> > > >
> > > > BTW - The article is correct.  DO NOT USE StreamXmlRecordReader.
> > > > XmlInputFormat is a lot faster.  From my testing,
> StreamXmlRecordReader
> > > > took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat
> > was
> > > > under 2 minutes. (Using 2 Core, 8GB machines)
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Ted Yu [mailto:yuzhihong@gmail.com]
> > > > Sent: Friday, July 16, 2010 9:44 PM
> > > > To: general@hadoop.apache.org
> > > > Subject: Re: Hadoop and XML
> > > >
> > > > From an earlier post:
> > > >
> http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
> > > >
> > > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
> > > > Peter.Minearo@reardencommerce.com> wrote:
> > > >
> > > > > Moving the variable to a local variable did not seem to work:
> > > > >
> > > > >
> > > > > </PrivateRateSet>vateRateSet>
> > > > >
> > > > >
> > > > >
> > > > > public void map(Object key, Object value, OutputCollector output,
> > > > > Reporter
> > > > > reporter) throws IOException {
> > > > >                Text valueText = (Text)value;
> > > > >                String valueString = new
> String(valueText.getBytes(),
> > > > > "UTF-8");
> > > > >                String keyString = getXmlKey(valueString);
> > > > >                 Text returnKeyText = new Text();
> > > > >                Text returnValueText = new Text();
> > > > >                returnKeyText.set(keyString);
> > > > >                returnValueText.set(valueString);
> > > > >                output.collect(returnKeyText, returnValueText); }
> > > > >
> > > > > -----Original Message-----
> > > > > From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
> > > > > Sent: Fri 7/16/2010 2:51 PM
> > > > > To: general@hadoop.apache.org
> > > > > Subject: RE: Hadoop and XML
> > > > >
> > > > > Whoops....right after I sent it and someone else made a suggestion;
> I
> > > > > realized what question 2 was about.  I can try that, but wouldn't
> > that
> > > >
> > > > > cause Object bloat?  During the Hadoop training I went through; it
> > was
> > > >
> > > > > mentioned to reuse the returning Key and Value objects to keep the
> > > > > number of Objects created down to a minimum.  Is this not really a
> > > > > valid point?
> > > > >
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
> > > > > Sent: Friday, July 16, 2010 2:44 PM
> > > > > To: general@hadoop.apache.org
> > > > > Subject: RE: Hadoop and XML
> > > > >
> > > > >
> > > > > I am not using multi-threaded Map tasks.  Also, if I understand
> your
> > > > > second question correctly:
> > > > > "Also can you try creating the output key and values in the map
> > > > > method(method lacal) ?"
> > > > > In the first code snippet I am doing exactly that.
> > > > >
> > > > > Below is the class that runs the Job.
> > > > >
> > > > > public class HadoopJobClient {
> > > > >
> > > > >        private static final Log LOGGER =
> > > > > LogFactory.getLog(Prds.class.getName());
> > > > >
> > > > >        public static void main(String[] args) {
> > > > >                JobConf conf = new JobConf(Prds.class);
> > > > >
> > > > >                conf.set("xmlinput.start", "<PrivateRateSet>");
> > > > >                conf.set("xmlinput.end", "</PrivateRateSet>");
> > > > >
> > > > >                conf.setJobName("PRDS Parse");
> > > > >
> > > > >                conf.setOutputKeyClass(Text.class);
> > > > >                conf.setOutputValueClass(Text.class);
> > > > >
> > > > >                conf.setMapperClass(PrdsMapper.class);
> > > > >                conf.setReducerClass(PrdsReducer.class);
> > > > >
> > > > >                conf.setInputFormat(XmlInputFormat.class);
> > > > >                conf.setOutputFormat(TextOutputFormat.class);
> > > > >
> > > > >                FileInputFormat.setInputPaths(conf, new
> > Path(args[0]));
> > > > >                FileOutputFormat.setOutputPath(conf, new
> > > > > Path(args[1]));
> > > > >
> > > > >                // Run the job
> > > > >                try {
> > > > >                        JobClient.runJob(conf);
> > > > >                } catch (IOException e) {
> > > > >                        LOGGER.error(e.getMessage(), e);
> > > > >                }
> > > > >
> > > > >        }
> > > > >
> > > > >
> > > > > }
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Soumya Banerjee [mailto:soumya.sbanerjee@gmail.com]
> > > > > Sent: Fri 7/16/2010 2:29 PM
> > > > > To: general@hadoop.apache.org
> > > > > Subject: Re: Hadoop and XML
> > > > >
> > > > > Hi,
> > > > >
> > > > > Can you please share the code of the job submission client ?
> > > > >
> > > > > Also can you try creating the output key and values in the map
> > > > > method(method
> > > > > lacal) ?
> > > > > Make sure you are not using multi threaded map task configuration.
> > > > >
> > > > > map()
> > > > > {
> > > > > private Text keyText = new Text();
> > > > >  private Text valueText = new Text();
> > > > >
> > > > > //rest of the code
> > > > > }
> > > > >
> > > > > Soumya.
> > > > >
> > > > > On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo <
> > > > > Peter.Minearo@reardencommerce.com> wrote:
> > > > >
> > > > > > I have an XML file that has sparse data in it.  I am running a
> > > > > > MapReduce Job that reads in an XML file, pulls out a Key from
> > within
> > > >
> > > > > > the XML snippet and then hands back the Key and the XML snippet
> (as
> > > > > > the Value) to the OutputCollector.  The reason is to sort the
> file
> > > > > back into order.
> > > > > > Below is the snippet of code.
> > > > > >
> > > > > > public class XmlMapper extends MapReduceBase implements Mapper {
> > > > > >
> > > > > >  private Text keyText = new Text();
> > > > > >  private Text valueText = new Text();
> > > > > >
> > > > > >  @SuppressWarnings("unchecked")
> > > > > >  public void map(Object key, Object value, OutputCollector
> output,
> > > > > > Reporter reporter) throws IOException {  Text valueText =
> > > > > > (Text)value;
> > > > >
> > > > > > String valueString = new String(valueText.getBytes(), "UTF-8");
> > > > > > String keyString = getXmlKey(valueString);
> > > > > > getKeyText().set(keyString);  getValueText().set(valueString);
> > > > > > output.collect(getKeyText(), getValueText());  }
> > > > > >
> > > > > >
> > > > > >  public Text getKeyText() {
> > > > > >  return keyText;
> > > > > >  }
> > > > > >
> > > > > >
> > > > > >  public void setKeyText(Text keyText) {  this.keyText = keyText;
>  }
> > > > > >
> > > > > >
> > > > > >  public Text getValueText() {
> > > > > >  return valueText;
> > > > > >  }
> > > > > >
> > > > > >
> > > > > >  public void setValueText(Text valueText) {  this.valueText =
> > > > > > valueText;  }
> > > > > >
> > > > > >
> > > > > >  private String getXmlKey(String value) {
> > > > > >        // Get the Key from the XML in the value.
> > > > > >  }
> > > > > >
> > > > > > }
> > > > > >
> > > > > > The XML snippet from the Value is fine when it is passed into the
> > > > > > map() method.  I am not changing any data either, just pulling
> out
> > > > > > information for the key.  The problem I am seeing is between the
> > Map
> > > >
> > > > > > phase and the Reduce phase, the XML is getting munged.  For
> > Example:
> > > > > >
> > > > > >  </PrivateRate>
> > > > > >  </PrivateRateSet>te>
> > > > > >
> > > > > > It is my understanding that Hadoop uses the same instance of the
> > Key
> > > >
> > > > > > and Value object when calling the Map method.  What changes is
> the
> > > > > > data within those instances.  So, I ran an experiment where I do
> > not
> > > >
> > > > > > have different Key or Value Text Objects.  I reuse the ones
> passed
> > > > > > into the method, like below:
> > > > > >
> > > > > > public class XmlMapper extends MapReduceBase implements Mapper {
> > > > > >
> > > > > >  @SuppressWarnings("unchecked")
> > > > > >  public void map(Object key, Object value, OutputCollector
> output,
> > > > > > Reporter reporter) throws IOException {  Text keyText =
> (Text)key;
> > > > > > Text valueText = (Text)value;  String valueString = new
> > > > > > String(valueText.getBytes(), "UTF-8");  String keyString =
> > > > > > getXmlKey(valueString);  keyText.set(keyString);
> > > > > > valueText.set(valueString);  output.collect(keyText, valueText);
>  }
> > > > > >
> > > > > >
> > > > > >  private String getXmlKey(String value) {
> > > > > >        // Get the Key from the XML in the value.
> > > > > >  }
> > > > > >
> > > > > > }
> > > > > >
> > > > > > What was interesting about this is the fact that the XML was
> > getting
> > > >
> > > > > > munged within the Map Phase.  When I changed over to the code at
> > the
> > > >
> > > > > > top, the Map phase was fine.  However, the Reduce phase picks up
> > the
> > > >
> > > > > > munged XML.  Trying to debug the problem, I came across this
> method
> > > > > > in
> > > > >
> > > > > > the Text Object:
> > > > > >
> > > > > > public void set(byte[] utf8, int start, int len) {
> > > > > >    setCapacity(len, false);
> > > > > >    System.arraycopy(utf8, start, bytes, 0, len);
> > > > > >    this.length = len;
> > > > > > }
> > > > > >
> > > > > > If the "bytes" array had a length of 1000 and the "utf8" array
> has
> > a
> > > >
> > > > > > length of 500; doing a System.arraycopy() would only copy the
> first
> > > > > > 500 from "utf8" to "bytes" but leave the last 500 in "bytes"
> alone.
> > > > > > Could this be the cause of the XML munging?
> > > > > >
> > > > > > All of this leads me to a few questions:
> > > > > >
> > > > > > 1) Has anyone successfully used XML snippets as the data format
> > > > > > within
> > > > >
> > > > > > a MapReduce job; not just reading from the file but used during
> the
> > > > > > shuffle?
> > > > > > 2) Is anyone seeing this problem with XML or any other format?
> > > > > > 3) Does anyone know what is going on?
> > > > > > 4) Is this a bug?
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Peter
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Hadoop and XML

Posted by Ted Yu <yu...@gmail.com>.
Interesting.
String class is able to handle this scenario:

  348       public String(byte[] data, String encoding) throws
UnsupportedEncodingException {
  349           this(data, 0, data.length, encoding);
  350       }



On Tue, Jul 20, 2010 at 6:01 AM, Jeff Bean <jw...@cloudera.com> wrote:

> I think the problem is here:
>
> String valueString = new String(valueText.getBytes(), "UTF-8");
>
> Javadoc for Text says:
>
> *getBytes<
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29
> >
> *()
>          Returns the raw bytes; however, only data up to
> getLength()<
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29
> >is
> valid.
>
> So try getting the length, truncating the byte array at the value returned
> by getLength() and THEN converting it to a String.
>
> Jeff
>
> On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > For your initial question on Text.set().
> > Text.setCapacity() allocates new byte array. Since keepData is false, old
> > data wouldn't be copied over.
> >
> > On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo <
> > Peter.Minearo@reardencommerce.com> wrote:
> >
> > > I am already using XmlInputFormat.  The input into the Map phase is not
> > > the problem.  The problem lays in between the Map and Reduce phase.
> > >
> > > BTW - The article is correct.  DO NOT USE StreamXmlRecordReader.
> > > XmlInputFormat is a lot faster.  From my testing, StreamXmlRecordReader
> > > took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat
> was
> > > under 2 minutes. (Using 2 Core, 8GB machines)
> > >
> > >
> > > -----Original Message-----
> > > From: Ted Yu [mailto:yuzhihong@gmail.com]
> > > Sent: Friday, July 16, 2010 9:44 PM
> > > To: general@hadoop.apache.org
> > > Subject: Re: Hadoop and XML
> > >
> > > From an earlier post:
> > > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
> > >
> > > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
> > > Peter.Minearo@reardencommerce.com> wrote:
> > >
> > > > Moving the variable to a local variable did not seem to work:
> > > >
> > > >
> > > > </PrivateRateSet>vateRateSet>
> > > >
> > > >
> > > >
> > > > public void map(Object key, Object value, OutputCollector output,
> > > > Reporter
> > > > reporter) throws IOException {
> > > >                Text valueText = (Text)value;
> > > >                String valueString = new String(valueText.getBytes(),
> > > > "UTF-8");
> > > >                String keyString = getXmlKey(valueString);
> > > >                 Text returnKeyText = new Text();
> > > >                Text returnValueText = new Text();
> > > >                returnKeyText.set(keyString);
> > > >                returnValueText.set(valueString);
> > > >                output.collect(returnKeyText, returnValueText); }
> > > >
> > > > -----Original Message-----
> > > > From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
> > > > Sent: Fri 7/16/2010 2:51 PM
> > > > To: general@hadoop.apache.org
> > > > Subject: RE: Hadoop and XML
> > > >
> > > > Whoops....right after I sent it and someone else made a suggestion; I
> > > > realized what question 2 was about.  I can try that, but wouldn't
> that
> > >
> > > > cause Object bloat?  During the Hadoop training I went through; it
> was
> > >
> > > > mentioned to reuse the returning Key and Value objects to keep the
> > > > number of Objects created down to a minimum.  Is this not really a
> > > > valid point?
> > > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
> > > > Sent: Friday, July 16, 2010 2:44 PM
> > > > To: general@hadoop.apache.org
> > > > Subject: RE: Hadoop and XML
> > > >
> > > >
> > > > I am not using multi-threaded Map tasks.  Also, if I understand your
> > > > second question correctly:
> > > > "Also can you try creating the output key and values in the map
> > > > method(method lacal) ?"
> > > > In the first code snippet I am doing exactly that.
> > > >
> > > > Below is the class that runs the Job.
> > > >
> > > > public class HadoopJobClient {
> > > >
> > > >        private static final Log LOGGER =
> > > > LogFactory.getLog(Prds.class.getName());
> > > >
> > > >        public static void main(String[] args) {
> > > >                JobConf conf = new JobConf(Prds.class);
> > > >
> > > >                conf.set("xmlinput.start", "<PrivateRateSet>");
> > > >                conf.set("xmlinput.end", "</PrivateRateSet>");
> > > >
> > > >                conf.setJobName("PRDS Parse");
> > > >
> > > >                conf.setOutputKeyClass(Text.class);
> > > >                conf.setOutputValueClass(Text.class);
> > > >
> > > >                conf.setMapperClass(PrdsMapper.class);
> > > >                conf.setReducerClass(PrdsReducer.class);
> > > >
> > > >                conf.setInputFormat(XmlInputFormat.class);
> > > >                conf.setOutputFormat(TextOutputFormat.class);
> > > >
> > > >                FileInputFormat.setInputPaths(conf, new
> Path(args[0]));
> > > >                FileOutputFormat.setOutputPath(conf, new
> > > > Path(args[1]));
> > > >
> > > >                // Run the job
> > > >                try {
> > > >                        JobClient.runJob(conf);
> > > >                } catch (IOException e) {
> > > >                        LOGGER.error(e.getMessage(), e);
> > > >                }
> > > >
> > > >        }
> > > >
> > > >
> > > > }
> > > >
> > > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Soumya Banerjee [mailto:soumya.sbanerjee@gmail.com]
> > > > Sent: Fri 7/16/2010 2:29 PM
> > > > To: general@hadoop.apache.org
> > > > Subject: Re: Hadoop and XML
> > > >
> > > > Hi,
> > > >
> > > > Can you please share the code of the job submission client ?
> > > >
> > > > Also can you try creating the output key and values in the map
> > > > method(method
> > > > lacal) ?
> > > > Make sure you are not using multi threaded map task configuration.
> > > >
> > > > map()
> > > > {
> > > > private Text keyText = new Text();
> > > >  private Text valueText = new Text();
> > > >
> > > > //rest of the code
> > > > }
> > > >
> > > > Soumya.
> > > >
> > > > On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo <
> > > > Peter.Minearo@reardencommerce.com> wrote:
> > > >
> > > > > I have an XML file that has sparse data in it.  I am running a
> > > > > MapReduce Job that reads in an XML file, pulls out a Key from
> within
> > >
> > > > > the XML snippet and then hands back the Key and the XML snippet (as
> > > > > the Value) to the OutputCollector.  The reason is to sort the file
> > > > back into order.
> > > > > Below is the snippet of code.
> > > > >
> > > > > public class XmlMapper extends MapReduceBase implements Mapper {
> > > > >
> > > > >  private Text keyText = new Text();
> > > > >  private Text valueText = new Text();
> > > > >
> > > > >  @SuppressWarnings("unchecked")
> > > > >  public void map(Object key, Object value, OutputCollector output,
> > > > > Reporter reporter) throws IOException {  Text valueText =
> > > > > (Text)value;
> > > >
> > > > > String valueString = new String(valueText.getBytes(), "UTF-8");
> > > > > String keyString = getXmlKey(valueString);
> > > > > getKeyText().set(keyString);  getValueText().set(valueString);
> > > > > output.collect(getKeyText(), getValueText());  }
> > > > >
> > > > >
> > > > >  public Text getKeyText() {
> > > > >  return keyText;
> > > > >  }
> > > > >
> > > > >
> > > > >  public void setKeyText(Text keyText) {  this.keyText = keyText;  }
> > > > >
> > > > >
> > > > >  public Text getValueText() {
> > > > >  return valueText;
> > > > >  }
> > > > >
> > > > >
> > > > >  public void setValueText(Text valueText) {  this.valueText =
> > > > > valueText;  }
> > > > >
> > > > >
> > > > >  private String getXmlKey(String value) {
> > > > >        // Get the Key from the XML in the value.
> > > > >  }
> > > > >
> > > > > }
> > > > >
> > > > > The XML snippet from the Value is fine when it is passed into the
> > > > > map() method.  I am not changing any data either, just pulling out
> > > > > information for the key.  The problem I am seeing is between the
> Map
> > >
> > > > > phase and the Reduce phase, the XML is getting munged.  For
> Example:
> > > > >
> > > > >  </PrivateRate>
> > > > >  </PrivateRateSet>te>
> > > > >
> > > > > It is my understanding that Hadoop uses the same instance of the
> Key
> > >
> > > > > and Value object when calling the Map method.  What changes is the
> > > > > data within those instances.  So, I ran an experiment where I do
> not
> > >
> > > > > have different Key or Value Text Objects.  I reuse the ones passed
> > > > > into the method, like below:
> > > > >
> > > > > public class XmlMapper extends MapReduceBase implements Mapper {
> > > > >
> > > > >  @SuppressWarnings("unchecked")
> > > > >  public void map(Object key, Object value, OutputCollector output,
> > > > > Reporter reporter) throws IOException {  Text keyText = (Text)key;
> > > > > Text valueText = (Text)value;  String valueString = new
> > > > > String(valueText.getBytes(), "UTF-8");  String keyString =
> > > > > getXmlKey(valueString);  keyText.set(keyString);
> > > > > valueText.set(valueString);  output.collect(keyText, valueText);  }
> > > > >
> > > > >
> > > > >  private String getXmlKey(String value) {
> > > > >        // Get the Key from the XML in the value.
> > > > >  }
> > > > >
> > > > > }
> > > > >
> > > > > What was interesting about this is the fact that the XML was
> getting
> > >
> > > > > munged within the Map Phase.  When I changed over to the code at
> the
> > >
> > > > > top, the Map phase was fine.  However, the Reduce phase picks up
> the
> > >
> > > > > munged XML.  Trying to debug the problem, I came across this method
> > > > > in
> > > >
> > > > > the Text Object:
> > > > >
> > > > > public void set(byte[] utf8, int start, int len) {
> > > > >    setCapacity(len, false);
> > > > >    System.arraycopy(utf8, start, bytes, 0, len);
> > > > >    this.length = len;
> > > > > }
> > > > >
> > > > > If the "bytes" array had a length of 1000 and the "utf8" array has
> a
> > >
> > > > > length of 500; doing a System.arraycopy() would only copy the first
> > > > > 500 from "utf8" to "bytes" but leave the last 500 in "bytes" alone.
> > > > > Could this be the cause of the XML munging?
> > > > >
> > > > > All of this leads me to a few questions:
> > > > >
> > > > > 1) Has anyone successfully used XML snippets as the data format
> > > > > within
> > > >
> > > > > a MapReduce job; not just reading from the file but used during the
> > > > > shuffle?
> > > > > 2) Is anyone seeing this problem with XML or any other format?
> > > > > 3) Does anyone know what is going on?
> > > > > 4) Is this a bug?
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Peter
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: Hadoop and XML

Posted by Jeff Bean <jw...@cloudera.com>.
I think the problem is here:

String valueString = new String(valueText.getBytes(), "UTF-8");

Javadoc for Text says:

*getBytes<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getBytes%28%29>
*()
          Returns the raw bytes; however, only data up to
getLength()<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Text.html#getLength%28%29>is
valid.

So try getting the length, truncating the byte array at the value returned
by getLength() and THEN converting it to a String.

Jeff

On Mon, Jul 19, 2010 at 9:08 AM, Ted Yu <yu...@gmail.com> wrote:

> For your initial question on Text.set().
> Text.setCapacity() allocates new byte array. Since keepData is false, old
> data wouldn't be copied over.
>
> On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo <
> Peter.Minearo@reardencommerce.com> wrote:
>
> > I am already using XmlInputFormat.  The input into the Map phase is not
> > the problem.  The problem lays in between the Map and Reduce phase.
> >
> > BTW - The article is correct.  DO NOT USE StreamXmlRecordReader.
> > XmlInputFormat is a lot faster.  From my testing, StreamXmlRecordReader
> > took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat was
> > under 2 minutes. (Using 2 Core, 8GB machines)
> >
> >
> > -----Original Message-----
> > From: Ted Yu [mailto:yuzhihong@gmail.com]
> > Sent: Friday, July 16, 2010 9:44 PM
> > To: general@hadoop.apache.org
> > Subject: Re: Hadoop and XML
> >
> > From an earlier post:
> > http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
> >
> > On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
> > Peter.Minearo@reardencommerce.com> wrote:
> >
> > > Moving the variable to a local variable did not seem to work:
> > >
> > >
> > > </PrivateRateSet>vateRateSet>
> > >
> > >
> > >
> > > public void map(Object key, Object value, OutputCollector output,
> > > Reporter
> > > reporter) throws IOException {
> > >                Text valueText = (Text)value;
> > >                String valueString = new String(valueText.getBytes(),
> > > "UTF-8");
> > >                String keyString = getXmlKey(valueString);
> > >                 Text returnKeyText = new Text();
> > >                Text returnValueText = new Text();
> > >                returnKeyText.set(keyString);
> > >                returnValueText.set(valueString);
> > >                output.collect(returnKeyText, returnValueText); }
> > >
> > > -----Original Message-----
> > > From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
> > > Sent: Fri 7/16/2010 2:51 PM
> > > To: general@hadoop.apache.org
> > > Subject: RE: Hadoop and XML
> > >
> > > Whoops....right after I sent it and someone else made a suggestion; I
> > > realized what question 2 was about.  I can try that, but wouldn't that
> >
> > > cause Object bloat?  During the Hadoop training I went through; it was
> >
> > > mentioned to reuse the returning Key and Value objects to keep the
> > > number of Objects created down to a minimum.  Is this not really a
> > > valid point?
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
> > > Sent: Friday, July 16, 2010 2:44 PM
> > > To: general@hadoop.apache.org
> > > Subject: RE: Hadoop and XML
> > >
> > >
> > > I am not using multi-threaded Map tasks.  Also, if I understand your
> > > second question correctly:
> > > "Also can you try creating the output key and values in the map
> > > method(method lacal) ?"
> > > In the first code snippet I am doing exactly that.
> > >
> > > Below is the class that runs the Job.
> > >
> > > public class HadoopJobClient {
> > >
> > >        private static final Log LOGGER =
> > > LogFactory.getLog(Prds.class.getName());
> > >
> > >        public static void main(String[] args) {
> > >                JobConf conf = new JobConf(Prds.class);
> > >
> > >                conf.set("xmlinput.start", "<PrivateRateSet>");
> > >                conf.set("xmlinput.end", "</PrivateRateSet>");
> > >
> > >                conf.setJobName("PRDS Parse");
> > >
> > >                conf.setOutputKeyClass(Text.class);
> > >                conf.setOutputValueClass(Text.class);
> > >
> > >                conf.setMapperClass(PrdsMapper.class);
> > >                conf.setReducerClass(PrdsReducer.class);
> > >
> > >                conf.setInputFormat(XmlInputFormat.class);
> > >                conf.setOutputFormat(TextOutputFormat.class);
> > >
> > >                FileInputFormat.setInputPaths(conf, new Path(args[0]));
> > >                FileOutputFormat.setOutputPath(conf, new
> > > Path(args[1]));
> > >
> > >                // Run the job
> > >                try {
> > >                        JobClient.runJob(conf);
> > >                } catch (IOException e) {
> > >                        LOGGER.error(e.getMessage(), e);
> > >                }
> > >
> > >        }
> > >
> > >
> > > }
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Soumya Banerjee [mailto:soumya.sbanerjee@gmail.com]
> > > Sent: Fri 7/16/2010 2:29 PM
> > > To: general@hadoop.apache.org
> > > Subject: Re: Hadoop and XML
> > >
> > > Hi,
> > >
> > > Can you please share the code of the job submission client ?
> > >
> > > Also can you try creating the output key and values in the map
> > > method(method
> > > lacal) ?
> > > Make sure you are not using multi threaded map task configuration.
> > >
> > > map()
> > > {
> > > private Text keyText = new Text();
> > >  private Text valueText = new Text();
> > >
> > > //rest of the code
> > > }
> > >
> > > Soumya.
> > >
> > > On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo <
> > > Peter.Minearo@reardencommerce.com> wrote:
> > >
> > > > I have an XML file that has sparse data in it.  I am running a
> > > > MapReduce Job that reads in an XML file, pulls out a Key from within
> >
> > > > the XML snippet and then hands back the Key and the XML snippet (as
> > > > the Value) to the OutputCollector.  The reason is to sort the file
> > > back into order.
> > > > Below is the snippet of code.
> > > >
> > > > public class XmlMapper extends MapReduceBase implements Mapper {
> > > >
> > > >  private Text keyText = new Text();
> > > >  private Text valueText = new Text();
> > > >
> > > >  @SuppressWarnings("unchecked")
> > > >  public void map(Object key, Object value, OutputCollector output,
> > > > Reporter reporter) throws IOException {  Text valueText =
> > > > (Text)value;
> > >
> > > > String valueString = new String(valueText.getBytes(), "UTF-8");
> > > > String keyString = getXmlKey(valueString);
> > > > getKeyText().set(keyString);  getValueText().set(valueString);
> > > > output.collect(getKeyText(), getValueText());  }
> > > >
> > > >
> > > >  public Text getKeyText() {
> > > >  return keyText;
> > > >  }
> > > >
> > > >
> > > >  public void setKeyText(Text keyText) {  this.keyText = keyText;  }
> > > >
> > > >
> > > >  public Text getValueText() {
> > > >  return valueText;
> > > >  }
> > > >
> > > >
> > > >  public void setValueText(Text valueText) {  this.valueText =
> > > > valueText;  }
> > > >
> > > >
> > > >  private String getXmlKey(String value) {
> > > >        // Get the Key from the XML in the value.
> > > >  }
> > > >
> > > > }
> > > >
> > > > The XML snippet from the Value is fine when it is passed into the
> > > > map() method.  I am not changing any data either, just pulling out
> > > > information for the key.  The problem I am seeing is between the Map
> >
> > > > phase and the Reduce phase, the XML is getting munged.  For Example:
> > > >
> > > >  </PrivateRate>
> > > >  </PrivateRateSet>te>
> > > >
> > > > It is my understanding that Hadoop uses the same instance of the Key
> >
> > > > and Value object when calling the Map method.  What changes is the
> > > > data within those instances.  So, I ran an experiment where I do not
> >
> > > > have different Key or Value Text Objects.  I reuse the ones passed
> > > > into the method, like below:
> > > >
> > > > public class XmlMapper extends MapReduceBase implements Mapper {
> > > >
> > > >  @SuppressWarnings("unchecked")
> > > >  public void map(Object key, Object value, OutputCollector output,
> > > > Reporter reporter) throws IOException {  Text keyText = (Text)key;
> > > > Text valueText = (Text)value;  String valueString = new
> > > > String(valueText.getBytes(), "UTF-8");  String keyString =
> > > > getXmlKey(valueString);  keyText.set(keyString);
> > > > valueText.set(valueString);  output.collect(keyText, valueText);  }
> > > >
> > > >
> > > >  private String getXmlKey(String value) {
> > > >        // Get the Key from the XML in the value.
> > > >  }
> > > >
> > > > }
> > > >
> > > > What was interesting about this is the fact that the XML was getting
> >
> > > > munged within the Map Phase.  When I changed over to the code at the
> >
> > > > top, the Map phase was fine.  However, the Reduce phase picks up the
> >
> > > > munged XML.  Trying to debug the problem, I came across this method
> > > > in
> > >
> > > > the Text Object:
> > > >
> > > > public void set(byte[] utf8, int start, int len) {
> > > >    setCapacity(len, false);
> > > >    System.arraycopy(utf8, start, bytes, 0, len);
> > > >    this.length = len;
> > > > }
> > > >
> > > > If the "bytes" array had a length of 1000 and the "utf8" array has a
> >
> > > > length of 500; doing a System.arraycopy() would only copy the first
> > > > 500 from "utf8" to "bytes" but leave the last 500 in "bytes" alone.
> > > > Could this be the cause of the XML munging?
> > > >
> > > > All of this leads me to a few questions:
> > > >
> > > > 1) Has anyone successfully used XML snippets as the data format
> > > > within
> > >
> > > > a MapReduce job; not just reading from the file but used during the
> > > > shuffle?
> > > > 2) Is anyone seeing this problem with XML or any other format?
> > > > 3) Does anyone know what is going on?
> > > > 4) Is this a bug?
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Peter
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
>

Re: Hadoop and XML

Posted by Ted Yu <yu...@gmail.com>.
For your initial question on Text.set().
Text.setCapacity() allocates new byte array. Since keepData is false, old
data wouldn't be copied over.

On Mon, Jul 19, 2010 at 8:01 AM, Peter Minearo <
Peter.Minearo@reardencommerce.com> wrote:

> I am already using XmlInputFormat.  The input into the Map phase is not
> the problem.  The problem lays in between the Map and Reduce phase.
>
> BTW - The article is correct.  DO NOT USE StreamXmlRecordReader.
> XmlInputFormat is a lot faster.  From my testing, StreamXmlRecordReader
> took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat was
> under 2 minutes. (Using 2 Core, 8GB machines)
>
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Friday, July 16, 2010 9:44 PM
> To: general@hadoop.apache.org
> Subject: Re: Hadoop and XML
>
> From an earlier post:
> http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
>
> On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
> Peter.Minearo@reardencommerce.com> wrote:
>
> > Moving the variable to a local variable did not seem to work:
> >
> >
> > </PrivateRateSet>vateRateSet>
> >
> >
> >
> > public void map(Object key, Object value, OutputCollector output,
> > Reporter
> > reporter) throws IOException {
> >                Text valueText = (Text)value;
> >                String valueString = new String(valueText.getBytes(),
> > "UTF-8");
> >                String keyString = getXmlKey(valueString);
> >                 Text returnKeyText = new Text();
> >                Text returnValueText = new Text();
> >                returnKeyText.set(keyString);
> >                returnValueText.set(valueString);
> >                output.collect(returnKeyText, returnValueText); }
> >
> > -----Original Message-----
> > From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
> > Sent: Fri 7/16/2010 2:51 PM
> > To: general@hadoop.apache.org
> > Subject: RE: Hadoop and XML
> >
> > Whoops....right after I sent it and someone else made a suggestion; I
> > realized what question 2 was about.  I can try that, but wouldn't that
>
> > cause Object bloat?  During the Hadoop training I went through; it was
>
> > mentioned to reuse the returning Key and Value objects to keep the
> > number of Objects created down to a minimum.  Is this not really a
> > valid point?
> >
> >
> >
> > -----Original Message-----
> > From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
> > Sent: Friday, July 16, 2010 2:44 PM
> > To: general@hadoop.apache.org
> > Subject: RE: Hadoop and XML
> >
> >
> > I am not using multi-threaded Map tasks.  Also, if I understand your
> > second question correctly:
> > "Also can you try creating the output key and values in the map
> > method(method lacal) ?"
> > In the first code snippet I am doing exactly that.
> >
> > Below is the class that runs the Job.
> >
> > public class HadoopJobClient {
> >
> >        private static final Log LOGGER =
> > LogFactory.getLog(Prds.class.getName());
> >
> >        public static void main(String[] args) {
> >                JobConf conf = new JobConf(Prds.class);
> >
> >                conf.set("xmlinput.start", "<PrivateRateSet>");
> >                conf.set("xmlinput.end", "</PrivateRateSet>");
> >
> >                conf.setJobName("PRDS Parse");
> >
> >                conf.setOutputKeyClass(Text.class);
> >                conf.setOutputValueClass(Text.class);
> >
> >                conf.setMapperClass(PrdsMapper.class);
> >                conf.setReducerClass(PrdsReducer.class);
> >
> >                conf.setInputFormat(XmlInputFormat.class);
> >                conf.setOutputFormat(TextOutputFormat.class);
> >
> >                FileInputFormat.setInputPaths(conf, new Path(args[0]));
> >                FileOutputFormat.setOutputPath(conf, new
> > Path(args[1]));
> >
> >                // Run the job
> >                try {
> >                        JobClient.runJob(conf);
> >                } catch (IOException e) {
> >                        LOGGER.error(e.getMessage(), e);
> >                }
> >
> >        }
> >
> >
> > }
> >
> >
> >
> >
> > -----Original Message-----
> > From: Soumya Banerjee [mailto:soumya.sbanerjee@gmail.com]
> > Sent: Fri 7/16/2010 2:29 PM
> > To: general@hadoop.apache.org
> > Subject: Re: Hadoop and XML
> >
> > Hi,
> >
> > Can you please share the code of the job submission client ?
> >
> > Also can you try creating the output key and values in the map
> > method(method
> > lacal) ?
> > Make sure you are not using multi threaded map task configuration.
> >
> > map()
> > {
> > private Text keyText = new Text();
> >  private Text valueText = new Text();
> >
> > //rest of the code
> > }
> >
> > Soumya.
> >
> > On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo <
> > Peter.Minearo@reardencommerce.com> wrote:
> >
> > > I have an XML file that has sparse data in it.  I am running a
> > > MapReduce Job that reads in an XML file, pulls out a Key from within
>
> > > the XML snippet and then hands back the Key and the XML snippet (as
> > > the Value) to the OutputCollector.  The reason is to sort the file
> > back into order.
> > > Below is the snippet of code.
> > >
> > > public class XmlMapper extends MapReduceBase implements Mapper {
> > >
> > >  private Text keyText = new Text();
> > >  private Text valueText = new Text();
> > >
> > >  @SuppressWarnings("unchecked")
> > >  public void map(Object key, Object value, OutputCollector output,
> > > Reporter reporter) throws IOException {  Text valueText =
> > > (Text)value;
> >
> > > String valueString = new String(valueText.getBytes(), "UTF-8");
> > > String keyString = getXmlKey(valueString);
> > > getKeyText().set(keyString);  getValueText().set(valueString);
> > > output.collect(getKeyText(), getValueText());  }
> > >
> > >
> > >  public Text getKeyText() {
> > >  return keyText;
> > >  }
> > >
> > >
> > >  public void setKeyText(Text keyText) {  this.keyText = keyText;  }
> > >
> > >
> > >  public Text getValueText() {
> > >  return valueText;
> > >  }
> > >
> > >
> > >  public void setValueText(Text valueText) {  this.valueText =
> > > valueText;  }
> > >
> > >
> > >  private String getXmlKey(String value) {
> > >        // Get the Key from the XML in the value.
> > >  }
> > >
> > > }
> > >
> > > The XML snippet from the Value is fine when it is passed into the
> > > map() method.  I am not changing any data either, just pulling out
> > > information for the key.  The problem I am seeing is between the Map
>
> > > phase and the Reduce phase, the XML is getting munged.  For Example:
> > >
> > >  </PrivateRate>
> > >  </PrivateRateSet>te>
> > >
> > > It is my understanding that Hadoop uses the same instance of the Key
>
> > > and Value object when calling the Map method.  What changes is the
> > > data within those instances.  So, I ran an experiment where I do not
>
> > > have different Key or Value Text Objects.  I reuse the ones passed
> > > into the method, like below:
> > >
> > > public class XmlMapper extends MapReduceBase implements Mapper {
> > >
> > >  @SuppressWarnings("unchecked")
> > >  public void map(Object key, Object value, OutputCollector output,
> > > Reporter reporter) throws IOException {  Text keyText = (Text)key;
> > > Text valueText = (Text)value;  String valueString = new
> > > String(valueText.getBytes(), "UTF-8");  String keyString =
> > > getXmlKey(valueString);  keyText.set(keyString);
> > > valueText.set(valueString);  output.collect(keyText, valueText);  }
> > >
> > >
> > >  private String getXmlKey(String value) {
> > >        // Get the Key from the XML in the value.
> > >  }
> > >
> > > }
> > >
> > > What was interesting about this is the fact that the XML was getting
>
> > > munged within the Map Phase.  When I changed over to the code at the
>
> > > top, the Map phase was fine.  However, the Reduce phase picks up the
>
> > > munged XML.  Trying to debug the problem, I came across this method
> > > in
> >
> > > the Text Object:
> > >
> > > public void set(byte[] utf8, int start, int len) {
> > >    setCapacity(len, false);
> > >    System.arraycopy(utf8, start, bytes, 0, len);
> > >    this.length = len;
> > > }
> > >
> > > If the "bytes" array had a length of 1000 and the "utf8" array has a
>
> > > length of 500; doing a System.arraycopy() would only copy the first
> > > 500 from "utf8" to "bytes" but leave the last 500 in "bytes" alone.
> > > Could this be the cause of the XML munging?
> > >
> > > All of this leads me to a few questions:
> > >
> > > 1) Has anyone successfully used XML snippets as the data format
> > > within
> >
> > > a MapReduce job; not just reading from the file but used during the
> > > shuffle?
> > > 2) Is anyone seeing this problem with XML or any other format?
> > > 3) Does anyone know what is going on?
> > > 4) Is this a bug?
> > >
> > >
> > > Thanks,
> > >
> > > Peter
> > >
> > >
> > >
> >
> >
> >
>

RE: Hadoop and XML

Posted by Peter Minearo <Pe...@Reardencommerce.com>.
I am already using XmlInputFormat.  The input into the Map phase is not
the problem.  The problem lays in between the Map and Reduce phase. 

BTW - The article is correct.  DO NOT USE StreamXmlRecordReader.
XmlInputFormat is a lot faster.  From my testing, StreamXmlRecordReader
took 8 minutes to read a 1 GB XML document; where as, XmlInputFormat was
under 2 minutes. (Using 2 Core, 8GB machines)
 

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Friday, July 16, 2010 9:44 PM
To: general@hadoop.apache.org
Subject: Re: Hadoop and XML

>From an earlier post:
http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
Peter.Minearo@reardencommerce.com> wrote:

> Moving the variable to a local variable did not seem to work:
>
>
> </PrivateRateSet>vateRateSet>
>
>
>
> public void map(Object key, Object value, OutputCollector output, 
> Reporter
> reporter) throws IOException {
>                Text valueText = (Text)value;
>                String valueString = new String(valueText.getBytes(), 
> "UTF-8");
>                String keyString = getXmlKey(valueString);
>                 Text returnKeyText = new Text();
>                Text returnValueText = new Text();
>                returnKeyText.set(keyString);
>                returnValueText.set(valueString);
>                output.collect(returnKeyText, returnValueText); }
>
> -----Original Message-----
> From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
> Sent: Fri 7/16/2010 2:51 PM
> To: general@hadoop.apache.org
> Subject: RE: Hadoop and XML
>
> Whoops....right after I sent it and someone else made a suggestion; I 
> realized what question 2 was about.  I can try that, but wouldn't that

> cause Object bloat?  During the Hadoop training I went through; it was

> mentioned to reuse the returning Key and Value objects to keep the 
> number of Objects created down to a minimum.  Is this not really a 
> valid point?
>
>
>
> -----Original Message-----
> From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
> Sent: Friday, July 16, 2010 2:44 PM
> To: general@hadoop.apache.org
> Subject: RE: Hadoop and XML
>
>
> I am not using multi-threaded Map tasks.  Also, if I understand your 
> second question correctly:
> "Also can you try creating the output key and values in the map 
> method(method lacal) ?"
> In the first code snippet I am doing exactly that.
>
> Below is the class that runs the Job.
>
> public class HadoopJobClient {
>
>        private static final Log LOGGER = 
> LogFactory.getLog(Prds.class.getName());
>
>        public static void main(String[] args) {
>                JobConf conf = new JobConf(Prds.class);
>
>                conf.set("xmlinput.start", "<PrivateRateSet>");
>                conf.set("xmlinput.end", "</PrivateRateSet>");
>
>                conf.setJobName("PRDS Parse");
>
>                conf.setOutputKeyClass(Text.class);
>                conf.setOutputValueClass(Text.class);
>
>                conf.setMapperClass(PrdsMapper.class);
>                conf.setReducerClass(PrdsReducer.class);
>
>                conf.setInputFormat(XmlInputFormat.class);
>                conf.setOutputFormat(TextOutputFormat.class);
>
>                FileInputFormat.setInputPaths(conf, new Path(args[0]));
>                FileOutputFormat.setOutputPath(conf, new 
> Path(args[1]));
>
>                // Run the job
>                try {
>                        JobClient.runJob(conf);
>                } catch (IOException e) {
>                        LOGGER.error(e.getMessage(), e);
>                }
>
>        }
>
>
> }
>
>
>
>
> -----Original Message-----
> From: Soumya Banerjee [mailto:soumya.sbanerjee@gmail.com]
> Sent: Fri 7/16/2010 2:29 PM
> To: general@hadoop.apache.org
> Subject: Re: Hadoop and XML
>
> Hi,
>
> Can you please share the code of the job submission client ?
>
> Also can you try creating the output key and values in the map 
> method(method
> lacal) ?
> Make sure you are not using multi threaded map task configuration.
>
> map()
> {
> private Text keyText = new Text();
>  private Text valueText = new Text();
>
> //rest of the code
> }
>
> Soumya.
>
> On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo < 
> Peter.Minearo@reardencommerce.com> wrote:
>
> > I have an XML file that has sparse data in it.  I am running a 
> > MapReduce Job that reads in an XML file, pulls out a Key from within

> > the XML snippet and then hands back the Key and the XML snippet (as 
> > the Value) to the OutputCollector.  The reason is to sort the file
> back into order.
> > Below is the snippet of code.
> >
> > public class XmlMapper extends MapReduceBase implements Mapper {
> >
> >  private Text keyText = new Text();
> >  private Text valueText = new Text();
> >
> >  @SuppressWarnings("unchecked")
> >  public void map(Object key, Object value, OutputCollector output, 
> > Reporter reporter) throws IOException {  Text valueText = 
> > (Text)value;
>
> > String valueString = new String(valueText.getBytes(), "UTF-8"); 
> > String keyString = getXmlKey(valueString); 
> > getKeyText().set(keyString);  getValueText().set(valueString); 
> > output.collect(getKeyText(), getValueText());  }
> >
> >
> >  public Text getKeyText() {
> >  return keyText;
> >  }
> >
> >
> >  public void setKeyText(Text keyText) {  this.keyText = keyText;  }
> >
> >
> >  public Text getValueText() {
> >  return valueText;
> >  }
> >
> >
> >  public void setValueText(Text valueText) {  this.valueText = 
> > valueText;  }
> >
> >
> >  private String getXmlKey(String value) {
> >        // Get the Key from the XML in the value.
> >  }
> >
> > }
> >
> > The XML snippet from the Value is fine when it is passed into the
> > map() method.  I am not changing any data either, just pulling out 
> > information for the key.  The problem I am seeing is between the Map

> > phase and the Reduce phase, the XML is getting munged.  For Example:
> >
> >  </PrivateRate>
> >  </PrivateRateSet>te>
> >
> > It is my understanding that Hadoop uses the same instance of the Key

> > and Value object when calling the Map method.  What changes is the 
> > data within those instances.  So, I ran an experiment where I do not

> > have different Key or Value Text Objects.  I reuse the ones passed 
> > into the method, like below:
> >
> > public class XmlMapper extends MapReduceBase implements Mapper {
> >
> >  @SuppressWarnings("unchecked")
> >  public void map(Object key, Object value, OutputCollector output, 
> > Reporter reporter) throws IOException {  Text keyText = (Text)key; 
> > Text valueText = (Text)value;  String valueString = new 
> > String(valueText.getBytes(), "UTF-8");  String keyString = 
> > getXmlKey(valueString);  keyText.set(keyString); 
> > valueText.set(valueString);  output.collect(keyText, valueText);  }
> >
> >
> >  private String getXmlKey(String value) {
> >        // Get the Key from the XML in the value.
> >  }
> >
> > }
> >
> > What was interesting about this is the fact that the XML was getting

> > munged within the Map Phase.  When I changed over to the code at the

> > top, the Map phase was fine.  However, the Reduce phase picks up the

> > munged XML.  Trying to debug the problem, I came across this method 
> > in
>
> > the Text Object:
> >
> > public void set(byte[] utf8, int start, int len) {
> >    setCapacity(len, false);
> >    System.arraycopy(utf8, start, bytes, 0, len);
> >    this.length = len;
> > }
> >
> > If the "bytes" array had a length of 1000 and the "utf8" array has a

> > length of 500; doing a System.arraycopy() would only copy the first 
> > 500 from "utf8" to "bytes" but leave the last 500 in "bytes" alone.
> > Could this be the cause of the XML munging?
> >
> > All of this leads me to a few questions:
> >
> > 1) Has anyone successfully used XML snippets as the data format 
> > within
>
> > a MapReduce job; not just reading from the file but used during the 
> > shuffle?
> > 2) Is anyone seeing this problem with XML or any other format?
> > 3) Does anyone know what is going on?
> > 4) Is this a bug?
> >
> >
> > Thanks,
> >
> > Peter
> >
> >
> >
>
>
>

Re: Hadoop and XML

Posted by Ted Yu <yu...@gmail.com>.
>From an earlier post:
http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

On Fri, Jul 16, 2010 at 3:07 PM, Peter Minearo <
Peter.Minearo@reardencommerce.com> wrote:

> Moving the variable to a local variable did not seem to work:
>
>
> </PrivateRateSet>vateRateSet>
>
>
>
> public void map(Object key, Object value, OutputCollector output, Reporter
> reporter) throws IOException {
>                Text valueText = (Text)value;
>                String valueString = new String(valueText.getBytes(),
> "UTF-8");
>                String keyString = getXmlKey(valueString);
>                 Text returnKeyText = new Text();
>                Text returnValueText = new Text();
>                returnKeyText.set(keyString);
>                returnValueText.set(valueString);
>                output.collect(returnKeyText, returnValueText);
> }
>
> -----Original Message-----
> From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
> Sent: Fri 7/16/2010 2:51 PM
> To: general@hadoop.apache.org
> Subject: RE: Hadoop and XML
>
> Whoops....right after I sent it and someone else made a suggestion; I
> realized what question 2 was about.  I can try that, but wouldn't that
> cause Object bloat?  During the Hadoop training I went through; it was
> mentioned to reuse the returning Key and Value objects to keep the
> number of Objects created down to a minimum.  Is this not really a valid
> point?
>
>
>
> -----Original Message-----
> From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
> Sent: Friday, July 16, 2010 2:44 PM
> To: general@hadoop.apache.org
> Subject: RE: Hadoop and XML
>
>
> I am not using multi-threaded Map tasks.  Also, if I understand your
> second question correctly:
> "Also can you try creating the output key and values in the map
> method(method lacal) ?"
> In the first code snippet I am doing exactly that.
>
> Below is the class that runs the Job.
>
> public class HadoopJobClient {
>
>        private static final Log LOGGER =
> LogFactory.getLog(Prds.class.getName());
>
>        public static void main(String[] args) {
>                JobConf conf = new JobConf(Prds.class);
>
>                conf.set("xmlinput.start", "<PrivateRateSet>");
>                conf.set("xmlinput.end", "</PrivateRateSet>");
>
>                conf.setJobName("PRDS Parse");
>
>                conf.setOutputKeyClass(Text.class);
>                conf.setOutputValueClass(Text.class);
>
>                conf.setMapperClass(PrdsMapper.class);
>                conf.setReducerClass(PrdsReducer.class);
>
>                conf.setInputFormat(XmlInputFormat.class);
>                conf.setOutputFormat(TextOutputFormat.class);
>
>                FileInputFormat.setInputPaths(conf, new Path(args[0]));
>                FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>
>                // Run the job
>                try {
>                        JobClient.runJob(conf);
>                } catch (IOException e) {
>                        LOGGER.error(e.getMessage(), e);
>                }
>
>        }
>
>
> }
>
>
>
>
> -----Original Message-----
> From: Soumya Banerjee [mailto:soumya.sbanerjee@gmail.com]
> Sent: Fri 7/16/2010 2:29 PM
> To: general@hadoop.apache.org
> Subject: Re: Hadoop and XML
>
> Hi,
>
> Can you please share the code of the job submission client ?
>
> Also can you try creating the output key and values in the map
> method(method
> lacal) ?
> Make sure you are not using multi threaded map task configuration.
>
> map()
> {
> private Text keyText = new Text();
>  private Text valueText = new Text();
>
> //rest of the code
> }
>
> Soumya.
>
> On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo <
> Peter.Minearo@reardencommerce.com> wrote:
>
> > I have an XML file that has sparse data in it.  I am running a
> > MapReduce Job that reads in an XML file, pulls out a Key from within
> > the XML snippet and then hands back the Key and the XML snippet (as
> > the Value) to the OutputCollector.  The reason is to sort the file
> back into order.
> > Below is the snippet of code.
> >
> > public class XmlMapper extends MapReduceBase implements Mapper {
> >
> >  private Text keyText = new Text();
> >  private Text valueText = new Text();
> >
> >  @SuppressWarnings("unchecked")
> >  public void map(Object key, Object value, OutputCollector output,
> > Reporter reporter) throws IOException {  Text valueText = (Text)value;
>
> > String valueString = new String(valueText.getBytes(), "UTF-8");
> > String keyString = getXmlKey(valueString);
> > getKeyText().set(keyString);  getValueText().set(valueString);
> > output.collect(getKeyText(), getValueText());  }
> >
> >
> >  public Text getKeyText() {
> >  return keyText;
> >  }
> >
> >
> >  public void setKeyText(Text keyText) {  this.keyText = keyText;  }
> >
> >
> >  public Text getValueText() {
> >  return valueText;
> >  }
> >
> >
> >  public void setValueText(Text valueText) {  this.valueText =
> > valueText;  }
> >
> >
> >  private String getXmlKey(String value) {
> >        // Get the Key from the XML in the value.
> >  }
> >
> > }
> >
> > The XML snippet from the Value is fine when it is passed into the
> > map() method.  I am not changing any data either, just pulling out
> > information for the key.  The problem I am seeing is between the Map
> > phase and the Reduce phase, the XML is getting munged.  For Example:
> >
> >  </PrivateRate>
> >  </PrivateRateSet>te>
> >
> > It is my understanding that Hadoop uses the same instance of the Key
> > and Value object when calling the Map method.  What changes is the
> > data within those instances.  So, I ran an experiment where I do not
> > have different Key or Value Text Objects.  I reuse the ones passed
> > into the method, like below:
> >
> > public class XmlMapper extends MapReduceBase implements Mapper {
> >
> >  @SuppressWarnings("unchecked")
> >  public void map(Object key, Object value, OutputCollector output,
> > Reporter reporter) throws IOException {  Text keyText = (Text)key;
> > Text valueText = (Text)value;  String valueString = new
> > String(valueText.getBytes(), "UTF-8");  String keyString =
> > getXmlKey(valueString);  keyText.set(keyString);
> > valueText.set(valueString);  output.collect(keyText, valueText);  }
> >
> >
> >  private String getXmlKey(String value) {
> >        // Get the Key from the XML in the value.
> >  }
> >
> > }
> >
> > What was interesting about this is the fact that the XML was getting
> > munged within the Map Phase.  When I changed over to the code at the
> > top, the Map phase was fine.  However, the Reduce phase picks up the
> > munged XML.  Trying to debug the problem, I came across this method in
>
> > the Text Object:
> >
> > public void set(byte[] utf8, int start, int len) {
> >    setCapacity(len, false);
> >    System.arraycopy(utf8, start, bytes, 0, len);
> >    this.length = len;
> > }
> >
> > If the "bytes" array had a length of 1000 and the "utf8" array has a
> > length of 500; doing a System.arraycopy() would only copy the first
> > 500 from "utf8" to "bytes" but leave the last 500 in "bytes" alone.
> > Could this be the cause of the XML munging?
> >
> > All of this leads me to a few questions:
> >
> > 1) Has anyone successfully used XML snippets as the data format within
>
> > a MapReduce job; not just reading from the file but used during the
> > shuffle?
> > 2) Is anyone seeing this problem with XML or any other format?
> > 3) Does anyone know what is going on?
> > 4) Is this a bug?
> >
> >
> > Thanks,
> >
> > Peter
> >
> >
> >
>
>
>

RE: Hadoop and XML

Posted by Peter Minearo <Pe...@Reardencommerce.com>.
Moving the variable to a local variable did not seem to work:


</PrivateRateSet>vateRateSet>



public void map(Object key, Object value, OutputCollector output, Reporter reporter) throws IOException {
		Text valueText = (Text)value;
		String valueString = new String(valueText.getBytes(), "UTF-8");
		String keyString = getXmlKey(valueString);
		Text returnKeyText = new Text();
		Text returnValueText = new Text();
		returnKeyText.set(keyString);
		returnValueText.set(valueString);
		output.collect(returnKeyText, returnValueText);
}

-----Original Message-----
From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com]
Sent: Fri 7/16/2010 2:51 PM
To: general@hadoop.apache.org
Subject: RE: Hadoop and XML
 
Whoops....right after I sent it and someone else made a suggestion; I
realized what question 2 was about.  I can try that, but wouldn't that
cause Object bloat?  During the Hadoop training I went through; it was
mentioned to reuse the returning Key and Value objects to keep the
number of Objects created down to a minimum.  Is this not really a valid
point?

 

-----Original Message-----
From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com] 
Sent: Friday, July 16, 2010 2:44 PM
To: general@hadoop.apache.org
Subject: RE: Hadoop and XML


I am not using multi-threaded Map tasks.  Also, if I understand your
second question correctly:
"Also can you try creating the output key and values in the map
method(method lacal) ?"
In the first code snippet I am doing exactly that.

Below is the class that runs the Job.

public class HadoopJobClient {

	private static final Log LOGGER =
LogFactory.getLog(Prds.class.getName());
	
	public static void main(String[] args) {
		JobConf conf = new JobConf(Prds.class);
		
		conf.set("xmlinput.start", "<PrivateRateSet>");
		conf.set("xmlinput.end", "</PrivateRateSet>");
		
		conf.setJobName("PRDS Parse");

		conf.setOutputKeyClass(Text.class);
		conf.setOutputValueClass(Text.class);

		conf.setMapperClass(PrdsMapper.class);
		conf.setReducerClass(PrdsReducer.class);

		conf.setInputFormat(XmlInputFormat.class);
		conf.setOutputFormat(TextOutputFormat.class);

		FileInputFormat.setInputPaths(conf, new Path(args[0]));
		FileOutputFormat.setOutputPath(conf, new Path(args[1]));

		// Run the job
		try {
			JobClient.runJob(conf);
		} catch (IOException e) {
			LOGGER.error(e.getMessage(), e);
		}

	}
	
	
}




-----Original Message-----
From: Soumya Banerjee [mailto:soumya.sbanerjee@gmail.com]
Sent: Fri 7/16/2010 2:29 PM
To: general@hadoop.apache.org
Subject: Re: Hadoop and XML
 
Hi,

Can you please share the code of the job submission client ?

Also can you try creating the output key and values in the map
method(method
lacal) ?
Make sure you are not using multi threaded map task configuration.

map()
{
private Text keyText = new Text();
 private Text valueText = new Text();

//rest of the code
}

Soumya.

On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo <
Peter.Minearo@reardencommerce.com> wrote:

> I have an XML file that has sparse data in it.  I am running a 
> MapReduce Job that reads in an XML file, pulls out a Key from within 
> the XML snippet and then hands back the Key and the XML snippet (as 
> the Value) to the OutputCollector.  The reason is to sort the file
back into order.
> Below is the snippet of code.
>
> public class XmlMapper extends MapReduceBase implements Mapper {
>
>  private Text keyText = new Text();
>  private Text valueText = new Text();
>
>  @SuppressWarnings("unchecked")
>  public void map(Object key, Object value, OutputCollector output, 
> Reporter reporter) throws IOException {  Text valueText = (Text)value;

> String valueString = new String(valueText.getBytes(), "UTF-8");  
> String keyString = getXmlKey(valueString);  
> getKeyText().set(keyString);  getValueText().set(valueString);  
> output.collect(getKeyText(), getValueText());  }
>
>
>  public Text getKeyText() {
>  return keyText;
>  }
>
>
>  public void setKeyText(Text keyText) {  this.keyText = keyText;  }
>
>
>  public Text getValueText() {
>  return valueText;
>  }
>
>
>  public void setValueText(Text valueText) {  this.valueText = 
> valueText;  }
>
>
>  private String getXmlKey(String value) {
>        // Get the Key from the XML in the value.
>  }
>
> }
>
> The XML snippet from the Value is fine when it is passed into the 
> map() method.  I am not changing any data either, just pulling out 
> information for the key.  The problem I am seeing is between the Map 
> phase and the Reduce phase, the XML is getting munged.  For Example:
>
>  </PrivateRate>
>  </PrivateRateSet>te>
>
> It is my understanding that Hadoop uses the same instance of the Key 
> and Value object when calling the Map method.  What changes is the 
> data within those instances.  So, I ran an experiment where I do not 
> have different Key or Value Text Objects.  I reuse the ones passed 
> into the method, like below:
>
> public class XmlMapper extends MapReduceBase implements Mapper {
>
>  @SuppressWarnings("unchecked")
>  public void map(Object key, Object value, OutputCollector output, 
> Reporter reporter) throws IOException {  Text keyText = (Text)key;  
> Text valueText = (Text)value;  String valueString = new 
> String(valueText.getBytes(), "UTF-8");  String keyString = 
> getXmlKey(valueString);  keyText.set(keyString);  
> valueText.set(valueString);  output.collect(keyText, valueText);  }
>
>
>  private String getXmlKey(String value) {
>        // Get the Key from the XML in the value.
>  }
>
> }
>
> What was interesting about this is the fact that the XML was getting 
> munged within the Map Phase.  When I changed over to the code at the 
> top, the Map phase was fine.  However, the Reduce phase picks up the 
> munged XML.  Trying to debug the problem, I came across this method in

> the Text Object:
>
> public void set(byte[] utf8, int start, int len) {
>    setCapacity(len, false);
>    System.arraycopy(utf8, start, bytes, 0, len);
>    this.length = len;
> }
>
> If the "bytes" array had a length of 1000 and the "utf8" array has a 
> length of 500; doing a System.arraycopy() would only copy the first 
> 500 from "utf8" to "bytes" but leave the last 500 in "bytes" alone.  
> Could this be the cause of the XML munging?
>
> All of this leads me to a few questions:
>
> 1) Has anyone successfully used XML snippets as the data format within

> a MapReduce job; not just reading from the file but used during the 
> shuffle?
> 2) Is anyone seeing this problem with XML or any other format?
> 3) Does anyone know what is going on?
> 4) Is this a bug?
>
>
> Thanks,
>
> Peter
>
>
>



RE: Hadoop and XML

Posted by Peter Minearo <Pe...@Reardencommerce.com>.
Whoops....right after I sent it and someone else made a suggestion; I
realized what question 2 was about.  I can try that, but wouldn't that
cause Object bloat?  During the Hadoop training I went through; it was
mentioned to reuse the returning Key and Value objects to keep the
number of Objects created down to a minimum.  Is this not really a valid
point?

 

-----Original Message-----
From: Peter Minearo [mailto:Peter.Minearo@Reardencommerce.com] 
Sent: Friday, July 16, 2010 2:44 PM
To: general@hadoop.apache.org
Subject: RE: Hadoop and XML


I am not using multi-threaded Map tasks.  Also, if I understand your
second question correctly:
"Also can you try creating the output key and values in the map
method(method lacal) ?"
In the first code snippet I am doing exactly that.

Below is the class that runs the Job.

public class HadoopJobClient {

	private static final Log LOGGER =
LogFactory.getLog(Prds.class.getName());
	
	public static void main(String[] args) {
		JobConf conf = new JobConf(Prds.class);
		
		conf.set("xmlinput.start", "<PrivateRateSet>");
		conf.set("xmlinput.end", "</PrivateRateSet>");
		
		conf.setJobName("PRDS Parse");

		conf.setOutputKeyClass(Text.class);
		conf.setOutputValueClass(Text.class);

		conf.setMapperClass(PrdsMapper.class);
		conf.setReducerClass(PrdsReducer.class);

		conf.setInputFormat(XmlInputFormat.class);
		conf.setOutputFormat(TextOutputFormat.class);

		FileInputFormat.setInputPaths(conf, new Path(args[0]));
		FileOutputFormat.setOutputPath(conf, new Path(args[1]));

		// Run the job
		try {
			JobClient.runJob(conf);
		} catch (IOException e) {
			LOGGER.error(e.getMessage(), e);
		}

	}
	
	
}




-----Original Message-----
From: Soumya Banerjee [mailto:soumya.sbanerjee@gmail.com]
Sent: Fri 7/16/2010 2:29 PM
To: general@hadoop.apache.org
Subject: Re: Hadoop and XML
 
Hi,

Can you please share the code of the job submission client ?

Also can you try creating the output key and values in the map
method(method
lacal) ?
Make sure you are not using multi threaded map task configuration.

map()
{
private Text keyText = new Text();
 private Text valueText = new Text();

//rest of the code
}

Soumya.

On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo <
Peter.Minearo@reardencommerce.com> wrote:

> I have an XML file that has sparse data in it.  I am running a 
> MapReduce Job that reads in an XML file, pulls out a Key from within 
> the XML snippet and then hands back the Key and the XML snippet (as 
> the Value) to the OutputCollector.  The reason is to sort the file
back into order.
> Below is the snippet of code.
>
> public class XmlMapper extends MapReduceBase implements Mapper {
>
>  private Text keyText = new Text();
>  private Text valueText = new Text();
>
>  @SuppressWarnings("unchecked")
>  public void map(Object key, Object value, OutputCollector output, 
> Reporter reporter) throws IOException {  Text valueText = (Text)value;

> String valueString = new String(valueText.getBytes(), "UTF-8");  
> String keyString = getXmlKey(valueString);  
> getKeyText().set(keyString);  getValueText().set(valueString);  
> output.collect(getKeyText(), getValueText());  }
>
>
>  public Text getKeyText() {
>  return keyText;
>  }
>
>
>  public void setKeyText(Text keyText) {  this.keyText = keyText;  }
>
>
>  public Text getValueText() {
>  return valueText;
>  }
>
>
>  public void setValueText(Text valueText) {  this.valueText = 
> valueText;  }
>
>
>  private String getXmlKey(String value) {
>        // Get the Key from the XML in the value.
>  }
>
> }
>
> The XML snippet from the Value is fine when it is passed into the 
> map() method.  I am not changing any data either, just pulling out 
> information for the key.  The problem I am seeing is between the Map 
> phase and the Reduce phase, the XML is getting munged.  For Example:
>
>  </PrivateRate>
>  </PrivateRateSet>te>
>
> It is my understanding that Hadoop uses the same instance of the Key 
> and Value object when calling the Map method.  What changes is the 
> data within those instances.  So, I ran an experiment where I do not 
> have different Key or Value Text Objects.  I reuse the ones passed 
> into the method, like below:
>
> public class XmlMapper extends MapReduceBase implements Mapper {
>
>  @SuppressWarnings("unchecked")
>  public void map(Object key, Object value, OutputCollector output, 
> Reporter reporter) throws IOException {  Text keyText = (Text)key;  
> Text valueText = (Text)value;  String valueString = new 
> String(valueText.getBytes(), "UTF-8");  String keyString = 
> getXmlKey(valueString);  keyText.set(keyString);  
> valueText.set(valueString);  output.collect(keyText, valueText);  }
>
>
>  private String getXmlKey(String value) {
>        // Get the Key from the XML in the value.
>  }
>
> }
>
> What was interesting about this is the fact that the XML was getting 
> munged within the Map Phase.  When I changed over to the code at the 
> top, the Map phase was fine.  However, the Reduce phase picks up the 
> munged XML.  Trying to debug the problem, I came across this method in

> the Text Object:
>
> public void set(byte[] utf8, int start, int len) {
>    setCapacity(len, false);
>    System.arraycopy(utf8, start, bytes, 0, len);
>    this.length = len;
> }
>
> If the "bytes" array had a length of 1000 and the "utf8" array has a 
> length of 500; doing a System.arraycopy() would only copy the first 
> 500 from "utf8" to "bytes" but leave the last 500 in "bytes" alone.  
> Could this be the cause of the XML munging?
>
> All of this leads me to a few questions:
>
> 1) Has anyone successfully used XML snippets as the data format within

> a MapReduce job; not just reading from the file but used during the 
> shuffle?
> 2) Is anyone seeing this problem with XML or any other format?
> 3) Does anyone know what is going on?
> 4) Is this a bug?
>
>
> Thanks,
>
> Peter
>
>
>


RE: Hadoop and XML

Posted by Peter Minearo <Pe...@Reardencommerce.com>.
I am not using multi-threaded Map tasks.  Also, if I understand your second question correctly:
"Also can you try creating the output key and values in the map method(method lacal) ?"
In the first code snippet I am doing exactly that.

Below is the class that runs the Job.

public class HadoopJobClient {

	private static final Log LOGGER = LogFactory.getLog(Prds.class.getName());
	
	public static void main(String[] args) {
		JobConf conf = new JobConf(Prds.class);
		
		conf.set("xmlinput.start", "<PrivateRateSet>");
		conf.set("xmlinput.end", "</PrivateRateSet>");
		
		conf.setJobName("PRDS Parse");

		conf.setOutputKeyClass(Text.class);
		conf.setOutputValueClass(Text.class);

		conf.setMapperClass(PrdsMapper.class);
		conf.setReducerClass(PrdsReducer.class);

		conf.setInputFormat(XmlInputFormat.class);
		conf.setOutputFormat(TextOutputFormat.class);

		FileInputFormat.setInputPaths(conf, new Path(args[0]));
		FileOutputFormat.setOutputPath(conf, new Path(args[1]));

		// Run the job
		try {
			JobClient.runJob(conf);
		} catch (IOException e) {
			LOGGER.error(e.getMessage(), e);
		}

	}
	
	
}




-----Original Message-----
From: Soumya Banerjee [mailto:soumya.sbanerjee@gmail.com]
Sent: Fri 7/16/2010 2:29 PM
To: general@hadoop.apache.org
Subject: Re: Hadoop and XML
 
Hi,

Can you please share the code of the job submission client ?

Also can you try creating the output key and values in the map method(method
lacal) ?
Make sure you are not using multi threaded map task configuration.

map()
{
private Text keyText = new Text();
 private Text valueText = new Text();

//rest of the code
}

Soumya.

On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo <
Peter.Minearo@reardencommerce.com> wrote:

> I have an XML file that has sparse data in it.  I am running a MapReduce
> Job that reads in an XML file, pulls out a Key from within the XML
> snippet and then hands back the Key and the XML snippet (as the Value)
> to the OutputCollector.  The reason is to sort the file back into order.
> Below is the snippet of code.
>
> public class XmlMapper extends MapReduceBase implements Mapper {
>
>  private Text keyText = new Text();
>  private Text valueText = new Text();
>
>  @SuppressWarnings("unchecked")
>  public void map(Object key, Object value, OutputCollector output,
> Reporter reporter) throws IOException {
>  Text valueText = (Text)value;
>  String valueString = new String(valueText.getBytes(), "UTF-8");
>  String keyString = getXmlKey(valueString);
>  getKeyText().set(keyString);
>  getValueText().set(valueString);
>  output.collect(getKeyText(), getValueText());
>  }
>
>
>  public Text getKeyText() {
>  return keyText;
>  }
>
>
>  public void setKeyText(Text keyText) {
>  this.keyText = keyText;
>  }
>
>
>  public Text getValueText() {
>  return valueText;
>  }
>
>
>  public void setValueText(Text valueText) {
>  this.valueText = valueText;
>  }
>
>
>  private String getXmlKey(String value) {
>        // Get the Key from the XML in the value.
>  }
>
> }
>
> The XML snippet from the Value is fine when it is passed into the map()
> method.  I am not changing any data either, just pulling out information
> for the key.  The problem I am seeing is between the Map phase and the
> Reduce phase, the XML is getting munged.  For Example:
>
>  </PrivateRate>
>  </PrivateRateSet>te>
>
> It is my understanding that Hadoop uses the same instance of the Key and
> Value object when calling the Map method.  What changes is the data
> within those instances.  So, I ran an experiment where I do not have
> different Key or Value Text Objects.  I reuse the ones passed into the
> method, like below:
>
> public class XmlMapper extends MapReduceBase implements Mapper {
>
>  @SuppressWarnings("unchecked")
>  public void map(Object key, Object value, OutputCollector output,
> Reporter reporter) throws IOException {
>  Text keyText = (Text)key;
>  Text valueText = (Text)value;
>  String valueString = new String(valueText.getBytes(), "UTF-8");
>  String keyString = getXmlKey(valueString);
>  keyText.set(keyString);
>  valueText.set(valueString);
>  output.collect(keyText, valueText);
>  }
>
>
>  private String getXmlKey(String value) {
>        // Get the Key from the XML in the value.
>  }
>
> }
>
> What was interesting about this is the fact that the XML was getting
> munged within the Map Phase.  When I changed over to the code at the
> top, the Map phase was fine.  However, the Reduce phase picks up the
> munged XML.  Trying to debug the problem, I came across this method in
> the Text Object:
>
> public void set(byte[] utf8, int start, int len) {
>    setCapacity(len, false);
>    System.arraycopy(utf8, start, bytes, 0, len);
>    this.length = len;
> }
>
> If the "bytes" array had a length of 1000 and the "utf8" array has a
> length of 500; doing a System.arraycopy() would only copy the first 500
> from "utf8" to "bytes" but leave the last 500 in "bytes" alone.  Could
> this be the cause of the XML munging?
>
> All of this leads me to a few questions:
>
> 1) Has anyone successfully used XML snippets as the data format within a
> MapReduce job; not just reading from the file but used during the
> shuffle?
> 2) Is anyone seeing this problem with XML or any other format?
> 3) Does anyone know what is going on?
> 4) Is this a bug?
>
>
> Thanks,
>
> Peter
>
>
>


Re: Hadoop and XML

Posted by Soumya Banerjee <so...@gmail.com>.
Hi,

Can you please share the code of the job submission client ?

Also can you try creating the output key and values in the map method(method
lacal) ?
Make sure you are not using multi threaded map task configuration.

map()
{
private Text keyText = new Text();
 private Text valueText = new Text();

//rest of the code
}

Soumya.

On Sat, Jul 17, 2010 at 2:30 AM, Peter Minearo <
Peter.Minearo@reardencommerce.com> wrote:

> I have an XML file that has sparse data in it.  I am running a MapReduce
> Job that reads in an XML file, pulls out a Key from within the XML
> snippet and then hands back the Key and the XML snippet (as the Value)
> to the OutputCollector.  The reason is to sort the file back into order.
> Below is the snippet of code.
>
> public class XmlMapper extends MapReduceBase implements Mapper {
>
>  private Text keyText = new Text();
>  private Text valueText = new Text();
>
>  @SuppressWarnings("unchecked")
>  public void map(Object key, Object value, OutputCollector output,
> Reporter reporter) throws IOException {
>  Text valueText = (Text)value;
>  String valueString = new String(valueText.getBytes(), "UTF-8");
>  String keyString = getXmlKey(valueString);
>  getKeyText().set(keyString);
>  getValueText().set(valueString);
>  output.collect(getKeyText(), getValueText());
>  }
>
>
>  public Text getKeyText() {
>  return keyText;
>  }
>
>
>  public void setKeyText(Text keyText) {
>  this.keyText = keyText;
>  }
>
>
>  public Text getValueText() {
>  return valueText;
>  }
>
>
>  public void setValueText(Text valueText) {
>  this.valueText = valueText;
>  }
>
>
>  private String getXmlKey(String value) {
>        // Get the Key from the XML in the value.
>  }
>
> }
>
> The XML snippet from the Value is fine when it is passed into the map()
> method.  I am not changing any data either, just pulling out information
> for the key.  The problem I am seeing is between the Map phase and the
> Reduce phase, the XML is getting munged.  For Example:
>
>  </PrivateRate>
>  </PrivateRateSet>te>
>
> It is my understanding that Hadoop uses the same instance of the Key and
> Value object when calling the Map method.  What changes is the data
> within those instances.  So, I ran an experiment where I do not have
> different Key or Value Text Objects.  I reuse the ones passed into the
> method, like below:
>
> public class XmlMapper extends MapReduceBase implements Mapper {
>
>  @SuppressWarnings("unchecked")
>  public void map(Object key, Object value, OutputCollector output,
> Reporter reporter) throws IOException {
>  Text keyText = (Text)key;
>  Text valueText = (Text)value;
>  String valueString = new String(valueText.getBytes(), "UTF-8");
>  String keyString = getXmlKey(valueString);
>  keyText.set(keyString);
>  valueText.set(valueString);
>  output.collect(keyText, valueText);
>  }
>
>
>  private String getXmlKey(String value) {
>        // Get the Key from the XML in the value.
>  }
>
> }
>
> What was interesting about this is the fact that the XML was getting
> munged within the Map Phase.  When I changed over to the code at the
> top, the Map phase was fine.  However, the Reduce phase picks up the
> munged XML.  Trying to debug the problem, I came across this method in
> the Text Object:
>
> public void set(byte[] utf8, int start, int len) {
>    setCapacity(len, false);
>    System.arraycopy(utf8, start, bytes, 0, len);
>    this.length = len;
> }
>
> If the "bytes" array had a length of 1000 and the "utf8" array has a
> length of 500; doing a System.arraycopy() would only copy the first 500
> from "utf8" to "bytes" but leave the last 500 in "bytes" alone.  Could
> this be the cause of the XML munging?
>
> All of this leads me to a few questions:
>
> 1) Has anyone successfully used XML snippets as the data format within a
> MapReduce job; not just reading from the file but used during the
> shuffle?
> 2) Is anyone seeing this problem with XML or any other format?
> 3) Does anyone know what is going on?
> 4) Is this a bug?
>
>
> Thanks,
>
> Peter
>
>
>