You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Joan <jo...@gmail.com> on 2011/01/03 17:56:57 UTC

How to split DBInputFormat?

Hi,

I'm trying load data from big table in Database. I'm using DBInputFormat but
when my Job try to get all records, It throws an execption:

*Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError: Java
heap space*

I'm trying to get millions of records and I would like using DBInputSplit
but I don't know how I used it and how many split I need?

Thanks

Joan

Re: How to split DBInputFormat?

Posted by Joan <jo...@gmail.com>.
Thanks,

I've incremented number map tasks and number of reduce tasks, Although
worksI think that it's not a solution so I will try both proposals

Joan

2011/1/4 Hari Sreekumar <hs...@clickable.com>

> Arvind,
>
> Where can I find DataDrivenInputFormat? Is it available in v0.20.2 and is
> it stable?
>
> Thanks,
> Hari
>
>
> On Tue, Jan 4, 2011 at 12:02 AM, arvind@cloudera.com <ar...@cloudera.com>wrote:
>
>> Joan,
>>
>> The DataDrivenInputFormat is a better fit for moving large volumes of data
>> as it generates WHERE clauses that help partition the data better.
>>
>> You could also use Sqoop <https://github.com/cloudera/sqoop> that makes
>> such large volume data migration between relational sources and HDFS a
>> breeze.
>>
>> Arvind
>>
>>
>> On Mon, Jan 3, 2011 at 8:56 AM, Joan <jo...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm trying load data from big table in Database. I'm using DBInputFormat
>>> but when my Job try to get all records, It throws an execption:
>>>
>>> *Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError:
>>> Java heap space*
>>>
>>> I'm trying to get millions of records and I would like using DBInputSplit
>>> but I don't know how I used it and how many split I need?
>>>
>>> Thanks
>>>
>>> Joan
>>>
>>
>>
>

Re: How to split DBInputFormat?

Posted by Hari Sreekumar <hs...@clickable.com>.
Thanks Sonal. I'll look into this tool as well.

hari

On Tue, Jan 4, 2011 at 3:57 PM, Sonal Goyal <so...@gmail.com> wrote:

> Hi Hari,
>
> I dont think DataDrivenDBInputFormat is available in 0.20.x, its only
> available in 0.21 versions. You can check hihoApache0.20 branch at
> https://github.com/sonalgoyal/hiho/ which backports the relevent db
> formats for Apache Hadoop 0.20 versions.
>
>
> Thanks and Regards,
> Sonal
> <https://github.com/sonalgoyal/hiho>Connect Hadoop with databases,
> Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho>
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>
>
> On Tue, Jan 4, 2011 at 10:36 AM, Hari Sreekumar <hs...@clickable.com>wrote:
>
>> Arvind,
>>
>> Where can I find DataDrivenInputFormat? Is it available in v0.20.2 and is
>> it stable?
>>
>> Thanks,
>> Hari
>>
>>
>> On Tue, Jan 4, 2011 at 12:02 AM, arvind@cloudera.com <arvind@cloudera.com
>> > wrote:
>>
>>> Joan,
>>>
>>> The DataDrivenInputFormat is a better fit for moving large volumes of
>>> data as it generates WHERE clauses that help partition the data better.
>>>
>>> You could also use Sqoop <https://github.com/cloudera/sqoop> that makes
>>> such large volume data migration between relational sources and HDFS a
>>> breeze.
>>>
>>> Arvind
>>>
>>>
>>> On Mon, Jan 3, 2011 at 8:56 AM, Joan <jo...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying load data from big table in Database. I'm using DBInputFormat
>>>> but when my Job try to get all records, It throws an execption:
>>>>
>>>> *Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError:
>>>> Java heap space*
>>>>
>>>> I'm trying to get millions of records and I would like using
>>>> DBInputSplit but I don't know how I used it and how many split I need?
>>>>
>>>> Thanks
>>>>
>>>> Joan
>>>>
>>>
>>>
>>
>

Re: How to split DBInputFormat?

Posted by Sonal Goyal <so...@gmail.com>.
Hi Hari,

I dont think DataDrivenDBInputFormat is available in 0.20.x, its only
available in 0.21 versions. You can check hihoApache0.20 branch at
https://github.com/sonalgoyal/hiho/ which backports the relevent db formats
for Apache Hadoop 0.20 versions.

Thanks and Regards,
Sonal
<https://github.com/sonalgoyal/hiho>Connect Hadoop with databases,
Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>





On Tue, Jan 4, 2011 at 10:36 AM, Hari Sreekumar <hs...@clickable.com>wrote:

> Arvind,
>
> Where can I find DataDrivenInputFormat? Is it available in v0.20.2 and is
> it stable?
>
> Thanks,
> Hari
>
>
> On Tue, Jan 4, 2011 at 12:02 AM, arvind@cloudera.com <ar...@cloudera.com>wrote:
>
>> Joan,
>>
>> The DataDrivenInputFormat is a better fit for moving large volumes of data
>> as it generates WHERE clauses that help partition the data better.
>>
>> You could also use Sqoop <https://github.com/cloudera/sqoop> that makes
>> such large volume data migration between relational sources and HDFS a
>> breeze.
>>
>> Arvind
>>
>>
>> On Mon, Jan 3, 2011 at 8:56 AM, Joan <jo...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm trying load data from big table in Database. I'm using DBInputFormat
>>> but when my Job try to get all records, It throws an execption:
>>>
>>> *Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError:
>>> Java heap space*
>>>
>>> I'm trying to get millions of records and I would like using DBInputSplit
>>> but I don't know how I used it and how many split I need?
>>>
>>> Thanks
>>>
>>> Joan
>>>
>>
>>
>

Re: How to split DBInputFormat?

Posted by Hari Sreekumar <hs...@clickable.com>.
Arvind,

Where can I find DataDrivenInputFormat? Is it available in v0.20.2 and is it
stable?

Thanks,
Hari

On Tue, Jan 4, 2011 at 12:02 AM, arvind@cloudera.com <ar...@cloudera.com>wrote:

> Joan,
>
> The DataDrivenInputFormat is a better fit for moving large volumes of data
> as it generates WHERE clauses that help partition the data better.
>
> You could also use Sqoop <https://github.com/cloudera/sqoop> that makes
> such large volume data migration between relational sources and HDFS a
> breeze.
>
> Arvind
>
>
> On Mon, Jan 3, 2011 at 8:56 AM, Joan <jo...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm trying load data from big table in Database. I'm using DBInputFormat
>> but when my Job try to get all records, It throws an execption:
>>
>> *Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError:
>> Java heap space*
>>
>> I'm trying to get millions of records and I would like using DBInputSplit
>> but I don't know how I used it and how many split I need?
>>
>> Thanks
>>
>> Joan
>>
>
>

Re: How to split DBInputFormat?

Posted by "arvind@cloudera.com" <ar...@cloudera.com>.
Joan,

The DataDrivenInputFormat is a better fit for moving large volumes of data
as it generates WHERE clauses that help partition the data better.

You could also use Sqoop <https://github.com/cloudera/sqoop> that makes such
large volume data migration between relational sources and HDFS a breeze.

Arvind

On Mon, Jan 3, 2011 at 8:56 AM, Joan <jo...@gmail.com> wrote:

> Hi,
>
> I'm trying load data from big table in Database. I'm using DBInputFormat
> but when my Job try to get all records, It throws an execption:
>
> *Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError:
> Java heap space*
>
> I'm trying to get millions of records and I would like using DBInputSplit
> but I don't know how I used it and how many split I need?
>
> Thanks
>
> Joan
>

Re: How to split DBInputFormat?

Posted by Sonal Goyal <so...@gmail.com>.
Hi Joan,

To get data from the database, you can check the open source framework HIHO
at https://github.com/sonalgoyal/hiho/

By providing details of your database and table to import as the
configuration values, the split will happen automatically for you. Please
feel free to write to me directly in case you see any issues.

Thanks and Regards,
Sonal
<https://github.com/sonalgoyal/hiho>Connect Hadoop with databases,
Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>





On Mon, Jan 3, 2011 at 10:26 PM, Joan <jo...@gmail.com> wrote:

> Hi,
>
> I'm trying load data from big table in Database. I'm using DBInputFormat
> but when my Job try to get all records, It throws an execption:
>
> *Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError:
> Java heap space*
>
> I'm trying to get millions of records and I would like using DBInputSplit
> but I don't know how I used it and how many split I need?
>
> Thanks
>
> Joan
>