You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by bhushan_mahale <bh...@persistent.co.in> on 2009/10/29 13:18:34 UTC

Large Text object to String conversion

Hi,

I am writing an M-R code using MapRunnable interface.
The input format is SequenceFileInputFormat.

Each Sequence-record contains a key-value pair of type <Text key,Text value> (Text: org.apache.hadoop.io.Text)

The "key" Text object contains small string where as "value" Text object contains large XML string.
"value" Text object can contain the data as large as 100 to 300 MB.

I convert the "value" Text object to String using value.toString() method.
It goes OutOfMemory for large data in "value" object.

Is there any other way for converting large Text object to java String object?
Alternatively, can I limit the number of records in RecordReader object coming to run method so that total memory utilization would be limited?

Thanks,
- Bhushan


DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: Large Text object to String conversion

Posted by Mark Kerzner <ma...@gmail.com>.

Bhushan,

have you considered simply raising the memory limit for Hadoop? 100M-300M is
not that much, and 2 Gigs is very mode memory requirement of the today's
machines. For comparison, small EC2 has 1.7 Gig

On Tue, Dec 22, 2009 at 9:10 AM, Jason Venner <ja...@gmail.com>wrote:

> The text class supports low level access to the underlying byte array in
> the
> text object
>
> You can call getbytes directly and then incrementally transcode the bytes
> into characters using the charset encoder tools,
> or call the charAt method to get the characters one by 1.
> The bytesToCodePoint method provides a simpler interface for sequentially
> working through the data.
>
> On Thu, Oct 29, 2009 at 4:18 AM, bhushan_mahale <
> bhushan_mahale@persistent.co.in> wrote:
>
> > Hi,
> >
> > I am writing an M-R code using MapRunnable interface.
> > The input format is SequenceFileInputFormat.
> >
> > Each Sequence-record contains a key-value pair of type <Text key,Text
> > value> (Text: org.apache.hadoop.io.Text)
> >
> > The "key" Text object contains small string where as "value" Text object
> > contains large XML string.
> > "value" Text object can contain the data as large as 100 to 300 MB.
> >
> > I convert the "value" Text object to String using value.toString()
> method.
> > It goes OutOfMemory for large data in "value" object.
> >
> > Is there any other way for converting large Text object to java String
> > object?
> > Alternatively, can I limit the number of records in RecordReader object
> > coming to run method so that total memory utilization would be limited?
> >
> > Thanks,
> > - Bhushan
> >
> >
> > DISCLAIMER
> > ==========
> > This e-mail may contain privileged and confidential information which is
> > the property of Persistent Systems Ltd. It is intended only for the use
> of
> > the individual or entity to which it is addressed. If you are not the
> > intended recipient, you are not authorized to read, retain, copy, print,
> > distribute or use this message. If you have received this communication
> in
> > error, please notify the sender and delete all copies of this message.
> > Persistent Systems Ltd. does not accept any liability for virus infected
> > mails.
> >
>
>
>
> --
> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> http://www.amazon.com/dp/1430219424?tag=jewlerymall
> www.prohadoopbook.com a community for Hadoop Professionals
>

Re: Large Text object to String conversion

Posted by Jason Venner <ja...@gmail.com>.

The text class supports low level access to the underlying byte array in the
text object

You can call getbytes directly and then incrementally transcode the bytes
into characters using the charset encoder tools,
or call the charAt method to get the characters one by 1.
The bytesToCodePoint method provides a simpler interface for sequentially
working through the data.

On Thu, Oct 29, 2009 at 4:18 AM, bhushan_mahale <
bhushan_mahale@persistent.co.in> wrote:

> Hi,
>
> I am writing an M-R code using MapRunnable interface.
> The input format is SequenceFileInputFormat.
>
> Each Sequence-record contains a key-value pair of type <Text key,Text
> value> (Text: org.apache.hadoop.io.Text)
>
> The "key" Text object contains small string where as "value" Text object
> contains large XML string.
> "value" Text object can contain the data as large as 100 to 300 MB.
>
> I convert the "value" Text object to String using value.toString() method.
> It goes OutOfMemory for large data in "value" object.
>
> Is there any other way for converting large Text object to java String
> object?
> Alternatively, can I limit the number of records in RecordReader object
> coming to run method so that total memory utilization would be limited?
>
> Thanks,
> - Bhushan
>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Re: What if an XML file is accross boundary of HDFS chunks?

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hey Steve,

I think I've run across code in SVN that is a splitter for XML entries  
like this.  Look at StreamXmlRecordReader, I think it does what you  
want.

Brian

On Oct 29, 2009, at 4:12 PM, Amandeep Khurana wrote:

> Store the entire xml in one line...
>
> On 10/29/09, Steve Gao <st...@yahoo.com> wrote:
>> Does anybody have the similar issue? If you store XML files in  
>> HDFS, how can
>> you make sure a chunk reads by a mapper does not contain partical  
>> data of an
>> XML segment?
>>
>> For example:
>>
>> <title>
>> <book>book1</book>
>> <author>me</author>
>> ..............what if this is the boundary of a  
>> chunk?...................
>> <year>2009</year>
>> <book>book2</book>
>>
>> <author>me</author>
>>
>> <year>2009</year>
>> <book>book3</book>
>>
>> <author>me</author>
>>
>> <year>2009</year>
>> <title>
>>
>>
>>
>>
>
>
> -- 
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz

Re: What if an XML file is accross boundary of HDFS chunks?

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hey Steve,

Look at the mailing list archives - there's a specialized input splitter that you could use that at least 2 different people suggested.

Brian

On Nov 16, 2009, at 2:02 PM, Steve Gao wrote:

> Thanks. But this is not a neat solution in case that the XML block is very large.
> Anybody has another solution? Thanks!
> 
> --- On Thu, 10/29/09, Amandeep Khurana <am...@gmail.com> wrote:
> 
> From: Amandeep Khurana <am...@gmail.com>
> Subject: Re: What if an XML file is accross boundary of HDFS chunks?
> To: common-user@hadoop.apache.org
> Date: Thursday, October 29, 2009, 5:12 PM
> 
> Store the entire xml in one line...
> 
> On 10/29/09, Steve Gao <st...@yahoo.com> wrote:
>> Does anybody have the similar issue? If you store XML files in HDFS, how can
>> you make sure a chunk reads by a mapper does not contain partical data of an
>> XML segment?
>> 
>> For example:
>> 
>> <title>
>> <book>book1</book>
>> <author>me</author>
>> ..............what if this is the boundary of a chunk?...................
>> <year>2009</year>
>> <book>book2</book>
>> 
>> <author>me</author>
>> 
>> <year>2009</year>
>> <book>book3</book>
>> 
>> <author>me</author>
>> 
>> <year>2009</year>
>> <title>
>> 
>> 
>> 
>> 
> 
> 
> -- 
> 
> 
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
> 
> 
>

Re: What if an XML file is accross boundary of HDFS chunks?

Posted by Steve Gao <st...@yahoo.com>.

Thanks. But this is not a neat solution in case that the XML block is very large.
Anybody has another solution? Thanks!

--- On Thu, 10/29/09, Amandeep Khurana <am...@gmail.com> wrote:

From: Amandeep Khurana <am...@gmail.com>
Subject: Re: What if an XML file is accross boundary of HDFS chunks?
To: common-user@hadoop.apache.org
Date: Thursday, October 29, 2009, 5:12 PM

Store the entire xml in one line...

On 10/29/09, Steve Gao <st...@yahoo.com> wrote:
> Does anybody have the similar issue? If you store XML files in HDFS, how can
> you make sure a chunk reads by a mapper does not contain partical data of an
> XML segment?
>
> For example:
>
> <title>
> <book>book1</book>
> <author>me</author>
> ..............what if this is the boundary of a chunk?...................
> <year>2009</year>
> <book>book2</book>
>
> <author>me</author>
>
> <year>2009</year>
> <book>book3</book>
>
> <author>me</author>
>
> <year>2009</year>
> <title>
>
>
>
>

-- 

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: What if an XML file is accross boundary of HDFS chunks?

Posted by Amandeep Khurana <am...@gmail.com>.

Store the entire xml in one line...

On 10/29/09, Steve Gao <st...@yahoo.com> wrote:
> Does anybody have the similar issue? If you store XML files in HDFS, how can
> you make sure a chunk reads by a mapper does not contain partical data of an
> XML segment?
>
> For example:
>
> <title>
> <book>book1</book>
> <author>me</author>
> ..............what if this is the boundary of a chunk?...................
> <year>2009</year>
> <book>book2</book>
>
> <author>me</author>
>
> <year>2009</year>
> <book>book3</book>
>
> <author>me</author>
>
> <year>2009</year>
> <title>
>
>
>
>


-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

What if an XML file is accross boundary of HDFS chunks?

Posted by Steve Gao <st...@yahoo.com>.

Does anybody have the similar issue? If you store XML files in HDFS, how can you make sure a chunk reads by a mapper does not contain partical data of an XML segment?

For example:

<title>
<book>book1</book>
<author>me</author>
..............what if this is the boundary of a chunk?...................
<year>2009</year>
<book>book2</book>

<author>me</author>

<year>2009</year>
<book>book3</book>

<author>me</author>

<year>2009</year>
<title>