You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Yaron Gonen <ya...@gmail.com> on 2012/09/11 14:41:26 UTC

Some general questions about DBInputFormat

Hi,
After reviewing the class's (not very complicated) code, I have some
questions I hope someone can answer:

   - (more general question) Are there many use-cases for using
   DBInputFormat? Do most Hadoop jobs take their input from files or DBs?
   - What happens when the database is updated during mappers' data
   retrieval phase? is there a way to lock the database before the data
   retrieval phase and release it afterwords?
   - Since all mappers open a connection to the same DBS, one cannot use
   hundreds of mapper. Is there a solution to this problem?

Thanks,
Yaron

Unsubscribe

Posted by Kunaal <ku...@gmail.com>.

Re: Some general questions about DBInputFormat

Posted by Yaron Gonen <ya...@gmail.com>.
Hi again Nick,
DBInputFormat does use Connection.TRANSACTION_SERIALIZABLE, but this a per
connection attribute. Since every mapper has its own connection, and every
connection is opened in a different time, every connection sees a different
snapshot of the DB and it can cause for example two mapper that process the
same record (if an insert command was performed).

On Wed, Sep 12, 2012 at 12:35 AM, Nick Jones <ni...@amd.com> wrote:

>  Hi Yaron,
>
> I haven't looked at/used it in awhile but I seem to remember that each
> mapper's SQL request was wrapped in a transaction to prevent the number of
> rows changing.  DBInputFormat uses Connection.TRANSACTION_SERIALIZABLE from
> java.sql.Connection to prevent changes in the number of rows selected from
> a where clause.
>
> The locking behavior I observed may have also been related to how MySQL
> was setup at the time.
>
>
> On 09/11/2012 09:25 AM, Yaron Gonen wrote:
>
> Thanks for the fast response.
> Nick, regarding locking a table: as far as I understood from the code,
> each mapper opens its own connection to the DB. I didn't see any code such
> that the job creates a transaction and passes it to the mapper. Did I
> miss something?
> again, thanks!
>
>
> On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <ni...@amd.com> wrote:
>
>> Hi Yaron
>>
>> Replies inline below.
>>
>>
>> On 09/11/2012 07:41 AM, Yaron Gonen wrote:
>>
>>>  Hi,
>>> After reviewing the class's (not very complicated) code, I have some
>>> questions I hope someone can answer:
>>>
>>>    * (more general question) Are there many use-cases for using
>>>
>>>     DBInputFormat? Do most Hadoop jobs take their input from files or
>>> DBs?
>>>
>>>  Bejoy's right, most jobs utilize data across HDFS or some other
>> distributed architecture to feed M/R at a sufficient rate. DBInputFormat
>> could be helpful in pulling pointers to other sources of data (e.g. file
>> paths for filers where actual binary content is stored).
>>
>>>
>>>   * What happens when the database is updated during mappers' data
>>>
>>>     retrieval phase? is there a way to lock the database before the
>>>     data retrieval phase and release it afterwords?
>>>
>>>  The whole job creates a transaction against the RBDMS that ensures
>> consistent state throughout the job.  Depending on the source and settings,
>> this might entirely lock a table or lock the selected rows by the query.
>>
>>>
>>>   * Since all mappers open a connection to the same DBS, one cannot
>>>
>>>     use hundreds of mapper. Is there a solution to this problem?
>>>
>>>  Depends on the connection limits and the number of rows requested.
>> I've found that the server suffered other problems first before connection
>> count limitations.
>>
>>>
>>> Thanks,
>>> Yaron
>>>
>>
>>
>>
>
>

Re: Some general questions about DBInputFormat

Posted by Yaron Gonen <ya...@gmail.com>.
Hi again Nick,
DBInputFormat does use Connection.TRANSACTION_SERIALIZABLE, but this a per
connection attribute. Since every mapper has its own connection, and every
connection is opened in a different time, every connection sees a different
snapshot of the DB and it can cause for example two mapper that process the
same record (if an insert command was performed).

On Wed, Sep 12, 2012 at 12:35 AM, Nick Jones <ni...@amd.com> wrote:

>  Hi Yaron,
>
> I haven't looked at/used it in awhile but I seem to remember that each
> mapper's SQL request was wrapped in a transaction to prevent the number of
> rows changing.  DBInputFormat uses Connection.TRANSACTION_SERIALIZABLE from
> java.sql.Connection to prevent changes in the number of rows selected from
> a where clause.
>
> The locking behavior I observed may have also been related to how MySQL
> was setup at the time.
>
>
> On 09/11/2012 09:25 AM, Yaron Gonen wrote:
>
> Thanks for the fast response.
> Nick, regarding locking a table: as far as I understood from the code,
> each mapper opens its own connection to the DB. I didn't see any code such
> that the job creates a transaction and passes it to the mapper. Did I
> miss something?
> again, thanks!
>
>
> On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <ni...@amd.com> wrote:
>
>> Hi Yaron
>>
>> Replies inline below.
>>
>>
>> On 09/11/2012 07:41 AM, Yaron Gonen wrote:
>>
>>>  Hi,
>>> After reviewing the class's (not very complicated) code, I have some
>>> questions I hope someone can answer:
>>>
>>>    * (more general question) Are there many use-cases for using
>>>
>>>     DBInputFormat? Do most Hadoop jobs take their input from files or
>>> DBs?
>>>
>>>  Bejoy's right, most jobs utilize data across HDFS or some other
>> distributed architecture to feed M/R at a sufficient rate. DBInputFormat
>> could be helpful in pulling pointers to other sources of data (e.g. file
>> paths for filers where actual binary content is stored).
>>
>>>
>>>   * What happens when the database is updated during mappers' data
>>>
>>>     retrieval phase? is there a way to lock the database before the
>>>     data retrieval phase and release it afterwords?
>>>
>>>  The whole job creates a transaction against the RBDMS that ensures
>> consistent state throughout the job.  Depending on the source and settings,
>> this might entirely lock a table or lock the selected rows by the query.
>>
>>>
>>>   * Since all mappers open a connection to the same DBS, one cannot
>>>
>>>     use hundreds of mapper. Is there a solution to this problem?
>>>
>>>  Depends on the connection limits and the number of rows requested.
>> I've found that the server suffered other problems first before connection
>> count limitations.
>>
>>>
>>> Thanks,
>>> Yaron
>>>
>>
>>
>>
>
>

Re: Some general questions about DBInputFormat

Posted by Yaron Gonen <ya...@gmail.com>.
Hi again Nick,
DBInputFormat does use Connection.TRANSACTION_SERIALIZABLE, but this a per
connection attribute. Since every mapper has its own connection, and every
connection is opened in a different time, every connection sees a different
snapshot of the DB and it can cause for example two mapper that process the
same record (if an insert command was performed).

On Wed, Sep 12, 2012 at 12:35 AM, Nick Jones <ni...@amd.com> wrote:

>  Hi Yaron,
>
> I haven't looked at/used it in awhile but I seem to remember that each
> mapper's SQL request was wrapped in a transaction to prevent the number of
> rows changing.  DBInputFormat uses Connection.TRANSACTION_SERIALIZABLE from
> java.sql.Connection to prevent changes in the number of rows selected from
> a where clause.
>
> The locking behavior I observed may have also been related to how MySQL
> was setup at the time.
>
>
> On 09/11/2012 09:25 AM, Yaron Gonen wrote:
>
> Thanks for the fast response.
> Nick, regarding locking a table: as far as I understood from the code,
> each mapper opens its own connection to the DB. I didn't see any code such
> that the job creates a transaction and passes it to the mapper. Did I
> miss something?
> again, thanks!
>
>
> On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <ni...@amd.com> wrote:
>
>> Hi Yaron
>>
>> Replies inline below.
>>
>>
>> On 09/11/2012 07:41 AM, Yaron Gonen wrote:
>>
>>>  Hi,
>>> After reviewing the class's (not very complicated) code, I have some
>>> questions I hope someone can answer:
>>>
>>>    * (more general question) Are there many use-cases for using
>>>
>>>     DBInputFormat? Do most Hadoop jobs take their input from files or
>>> DBs?
>>>
>>>  Bejoy's right, most jobs utilize data across HDFS or some other
>> distributed architecture to feed M/R at a sufficient rate. DBInputFormat
>> could be helpful in pulling pointers to other sources of data (e.g. file
>> paths for filers where actual binary content is stored).
>>
>>>
>>>   * What happens when the database is updated during mappers' data
>>>
>>>     retrieval phase? is there a way to lock the database before the
>>>     data retrieval phase and release it afterwords?
>>>
>>>  The whole job creates a transaction against the RBDMS that ensures
>> consistent state throughout the job.  Depending on the source and settings,
>> this might entirely lock a table or lock the selected rows by the query.
>>
>>>
>>>   * Since all mappers open a connection to the same DBS, one cannot
>>>
>>>     use hundreds of mapper. Is there a solution to this problem?
>>>
>>>  Depends on the connection limits and the number of rows requested.
>> I've found that the server suffered other problems first before connection
>> count limitations.
>>
>>>
>>> Thanks,
>>> Yaron
>>>
>>
>>
>>
>
>

Re: Some general questions about DBInputFormat

Posted by Yaron Gonen <ya...@gmail.com>.
Hi again Nick,
DBInputFormat does use Connection.TRANSACTION_SERIALIZABLE, but this a per
connection attribute. Since every mapper has its own connection, and every
connection is opened in a different time, every connection sees a different
snapshot of the DB and it can cause for example two mapper that process the
same record (if an insert command was performed).

On Wed, Sep 12, 2012 at 12:35 AM, Nick Jones <ni...@amd.com> wrote:

>  Hi Yaron,
>
> I haven't looked at/used it in awhile but I seem to remember that each
> mapper's SQL request was wrapped in a transaction to prevent the number of
> rows changing.  DBInputFormat uses Connection.TRANSACTION_SERIALIZABLE from
> java.sql.Connection to prevent changes in the number of rows selected from
> a where clause.
>
> The locking behavior I observed may have also been related to how MySQL
> was setup at the time.
>
>
> On 09/11/2012 09:25 AM, Yaron Gonen wrote:
>
> Thanks for the fast response.
> Nick, regarding locking a table: as far as I understood from the code,
> each mapper opens its own connection to the DB. I didn't see any code such
> that the job creates a transaction and passes it to the mapper. Did I
> miss something?
> again, thanks!
>
>
> On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <ni...@amd.com> wrote:
>
>> Hi Yaron
>>
>> Replies inline below.
>>
>>
>> On 09/11/2012 07:41 AM, Yaron Gonen wrote:
>>
>>>  Hi,
>>> After reviewing the class's (not very complicated) code, I have some
>>> questions I hope someone can answer:
>>>
>>>    * (more general question) Are there many use-cases for using
>>>
>>>     DBInputFormat? Do most Hadoop jobs take their input from files or
>>> DBs?
>>>
>>>  Bejoy's right, most jobs utilize data across HDFS or some other
>> distributed architecture to feed M/R at a sufficient rate. DBInputFormat
>> could be helpful in pulling pointers to other sources of data (e.g. file
>> paths for filers where actual binary content is stored).
>>
>>>
>>>   * What happens when the database is updated during mappers' data
>>>
>>>     retrieval phase? is there a way to lock the database before the
>>>     data retrieval phase and release it afterwords?
>>>
>>>  The whole job creates a transaction against the RBDMS that ensures
>> consistent state throughout the job.  Depending on the source and settings,
>> this might entirely lock a table or lock the selected rows by the query.
>>
>>>
>>>   * Since all mappers open a connection to the same DBS, one cannot
>>>
>>>     use hundreds of mapper. Is there a solution to this problem?
>>>
>>>  Depends on the connection limits and the number of rows requested.
>> I've found that the server suffered other problems first before connection
>> count limitations.
>>
>>>
>>> Thanks,
>>> Yaron
>>>
>>
>>
>>
>
>

Re: Some general questions about DBInputFormat

Posted by Nick Jones <ni...@amd.com>.
Hi Yaron,

I haven't looked at/used it in awhile but I seem to remember that each 
mapper's SQL request was wrapped in a transaction to prevent the number 
of rows changing.  DBInputFormat uses 
Connection.TRANSACTION_SERIALIZABLE from java.sql.Connection to prevent 
changes in the number of rows selected from a where clause.

The locking behavior I observed may have also been related to how MySQL 
was setup at the time.

On 09/11/2012 09:25 AM, Yaron Gonen wrote:
> Thanks for the fast response.
> Nick, regarding locking a table: as far as I understood from the code, 
> each mapper opens its own connection to the DB. I didn't see any code 
> such that the job creates a transaction and passes it to the mapper. 
> Did I miss something?
> again, thanks!
>
>
> On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <nick.jones@amd.com 
> <ma...@amd.com>> wrote:
>
>     Hi Yaron
>
>     Replies inline below.
>
>
>     On 09/11/2012 07:41 AM, Yaron Gonen wrote:
>
>         Hi,
>         After reviewing the class's (not very complicated) code, I
>         have some questions I hope someone can answer:
>
>           * (more general question) Are there many use-cases for using
>
>             DBInputFormat? Do most Hadoop jobs take their input from
>         files or DBs?
>
>     Bejoy's right, most jobs utilize data across HDFS or some other
>     distributed architecture to feed M/R at a sufficient rate.
>     DBInputFormat could be helpful in pulling pointers to other
>     sources of data (e.g. file paths for filers where actual binary
>     content is stored).
>
>
>           * What happens when the database is updated during mappers'
>         data
>
>             retrieval phase? is there a way to lock the database
>         before the
>             data retrieval phase and release it afterwords?
>
>     The whole job creates a transaction against the RBDMS that ensures
>     consistent state throughout the job.  Depending on the source and
>     settings, this might entirely lock a table or lock the selected
>     rows by the query.
>
>
>           * Since all mappers open a connection to the same DBS, one
>         cannot
>
>             use hundreds of mapper. Is there a solution to this problem?
>
>     Depends on the connection limits and the number of rows requested.
>     I've found that the server suffered other problems first before
>     connection count limitations.
>
>
>         Thanks,
>         Yaron
>
>
>
>


Re: Some general questions about DBInputFormat

Posted by Nick Jones <ni...@amd.com>.
Hi Yaron,

I haven't looked at/used it in awhile but I seem to remember that each 
mapper's SQL request was wrapped in a transaction to prevent the number 
of rows changing.  DBInputFormat uses 
Connection.TRANSACTION_SERIALIZABLE from java.sql.Connection to prevent 
changes in the number of rows selected from a where clause.

The locking behavior I observed may have also been related to how MySQL 
was setup at the time.

On 09/11/2012 09:25 AM, Yaron Gonen wrote:
> Thanks for the fast response.
> Nick, regarding locking a table: as far as I understood from the code, 
> each mapper opens its own connection to the DB. I didn't see any code 
> such that the job creates a transaction and passes it to the mapper. 
> Did I miss something?
> again, thanks!
>
>
> On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <nick.jones@amd.com 
> <ma...@amd.com>> wrote:
>
>     Hi Yaron
>
>     Replies inline below.
>
>
>     On 09/11/2012 07:41 AM, Yaron Gonen wrote:
>
>         Hi,
>         After reviewing the class's (not very complicated) code, I
>         have some questions I hope someone can answer:
>
>           * (more general question) Are there many use-cases for using
>
>             DBInputFormat? Do most Hadoop jobs take their input from
>         files or DBs?
>
>     Bejoy's right, most jobs utilize data across HDFS or some other
>     distributed architecture to feed M/R at a sufficient rate.
>     DBInputFormat could be helpful in pulling pointers to other
>     sources of data (e.g. file paths for filers where actual binary
>     content is stored).
>
>
>           * What happens when the database is updated during mappers'
>         data
>
>             retrieval phase? is there a way to lock the database
>         before the
>             data retrieval phase and release it afterwords?
>
>     The whole job creates a transaction against the RBDMS that ensures
>     consistent state throughout the job.  Depending on the source and
>     settings, this might entirely lock a table or lock the selected
>     rows by the query.
>
>
>           * Since all mappers open a connection to the same DBS, one
>         cannot
>
>             use hundreds of mapper. Is there a solution to this problem?
>
>     Depends on the connection limits and the number of rows requested.
>     I've found that the server suffered other problems first before
>     connection count limitations.
>
>
>         Thanks,
>         Yaron
>
>
>
>


Re: Some general questions about DBInputFormat

Posted by Nick Jones <ni...@amd.com>.
Hi Yaron,

I haven't looked at/used it in awhile but I seem to remember that each 
mapper's SQL request was wrapped in a transaction to prevent the number 
of rows changing.  DBInputFormat uses 
Connection.TRANSACTION_SERIALIZABLE from java.sql.Connection to prevent 
changes in the number of rows selected from a where clause.

The locking behavior I observed may have also been related to how MySQL 
was setup at the time.

On 09/11/2012 09:25 AM, Yaron Gonen wrote:
> Thanks for the fast response.
> Nick, regarding locking a table: as far as I understood from the code, 
> each mapper opens its own connection to the DB. I didn't see any code 
> such that the job creates a transaction and passes it to the mapper. 
> Did I miss something?
> again, thanks!
>
>
> On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <nick.jones@amd.com 
> <ma...@amd.com>> wrote:
>
>     Hi Yaron
>
>     Replies inline below.
>
>
>     On 09/11/2012 07:41 AM, Yaron Gonen wrote:
>
>         Hi,
>         After reviewing the class's (not very complicated) code, I
>         have some questions I hope someone can answer:
>
>           * (more general question) Are there many use-cases for using
>
>             DBInputFormat? Do most Hadoop jobs take their input from
>         files or DBs?
>
>     Bejoy's right, most jobs utilize data across HDFS or some other
>     distributed architecture to feed M/R at a sufficient rate.
>     DBInputFormat could be helpful in pulling pointers to other
>     sources of data (e.g. file paths for filers where actual binary
>     content is stored).
>
>
>           * What happens when the database is updated during mappers'
>         data
>
>             retrieval phase? is there a way to lock the database
>         before the
>             data retrieval phase and release it afterwords?
>
>     The whole job creates a transaction against the RBDMS that ensures
>     consistent state throughout the job.  Depending on the source and
>     settings, this might entirely lock a table or lock the selected
>     rows by the query.
>
>
>           * Since all mappers open a connection to the same DBS, one
>         cannot
>
>             use hundreds of mapper. Is there a solution to this problem?
>
>     Depends on the connection limits and the number of rows requested.
>     I've found that the server suffered other problems first before
>     connection count limitations.
>
>
>         Thanks,
>         Yaron
>
>
>
>


Re: Some general questions about DBInputFormat

Posted by Nick Jones <ni...@amd.com>.
Hi Yaron,

I haven't looked at/used it in awhile but I seem to remember that each 
mapper's SQL request was wrapped in a transaction to prevent the number 
of rows changing.  DBInputFormat uses 
Connection.TRANSACTION_SERIALIZABLE from java.sql.Connection to prevent 
changes in the number of rows selected from a where clause.

The locking behavior I observed may have also been related to how MySQL 
was setup at the time.

On 09/11/2012 09:25 AM, Yaron Gonen wrote:
> Thanks for the fast response.
> Nick, regarding locking a table: as far as I understood from the code, 
> each mapper opens its own connection to the DB. I didn't see any code 
> such that the job creates a transaction and passes it to the mapper. 
> Did I miss something?
> again, thanks!
>
>
> On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <nick.jones@amd.com 
> <ma...@amd.com>> wrote:
>
>     Hi Yaron
>
>     Replies inline below.
>
>
>     On 09/11/2012 07:41 AM, Yaron Gonen wrote:
>
>         Hi,
>         After reviewing the class's (not very complicated) code, I
>         have some questions I hope someone can answer:
>
>           * (more general question) Are there many use-cases for using
>
>             DBInputFormat? Do most Hadoop jobs take their input from
>         files or DBs?
>
>     Bejoy's right, most jobs utilize data across HDFS or some other
>     distributed architecture to feed M/R at a sufficient rate.
>     DBInputFormat could be helpful in pulling pointers to other
>     sources of data (e.g. file paths for filers where actual binary
>     content is stored).
>
>
>           * What happens when the database is updated during mappers'
>         data
>
>             retrieval phase? is there a way to lock the database
>         before the
>             data retrieval phase and release it afterwords?
>
>     The whole job creates a transaction against the RBDMS that ensures
>     consistent state throughout the job.  Depending on the source and
>     settings, this might entirely lock a table or lock the selected
>     rows by the query.
>
>
>           * Since all mappers open a connection to the same DBS, one
>         cannot
>
>             use hundreds of mapper. Is there a solution to this problem?
>
>     Depends on the connection limits and the number of rows requested.
>     I've found that the server suffered other problems first before
>     connection count limitations.
>
>
>         Thanks,
>         Yaron
>
>
>
>


Re: Some general questions about DBInputFormat

Posted by Yaron Gonen <ya...@gmail.com>.
Thanks for the fast response.
Nick, regarding locking a table: as far as I understood from the code, each
mapper opens its own connection to the DB. I didn't see any code such that
the job creates a transaction and passes it to the mapper. Did I
miss something?
again, thanks!


On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <ni...@amd.com> wrote:

> Hi Yaron
>
> Replies inline below.
>
>
> On 09/11/2012 07:41 AM, Yaron Gonen wrote:
>
>> Hi,
>> After reviewing the class's (not very complicated) code, I have some
>> questions I hope someone can answer:
>>
>>   * (more general question) Are there many use-cases for using
>>
>>     DBInputFormat? Do most Hadoop jobs take their input from files or DBs?
>>
>>  Bejoy's right, most jobs utilize data across HDFS or some other
> distributed architecture to feed M/R at a sufficient rate. DBInputFormat
> could be helpful in pulling pointers to other sources of data (e.g. file
> paths for filers where actual binary content is stored).
>
>>
>>   * What happens when the database is updated during mappers' data
>>
>>     retrieval phase? is there a way to lock the database before the
>>     data retrieval phase and release it afterwords?
>>
>>  The whole job creates a transaction against the RBDMS that ensures
> consistent state throughout the job.  Depending on the source and settings,
> this might entirely lock a table or lock the selected rows by the query.
>
>>
>>   * Since all mappers open a connection to the same DBS, one cannot
>>
>>     use hundreds of mapper. Is there a solution to this problem?
>>
>>  Depends on the connection limits and the number of rows requested. I've
> found that the server suffered other problems first before connection count
> limitations.
>
>>
>> Thanks,
>> Yaron
>>
>
>
>

Unsubscribe

Posted by Kunaal <ku...@gmail.com>.

Re: Some general questions about DBInputFormat

Posted by Yaron Gonen <ya...@gmail.com>.
Thanks for the fast response.
Nick, regarding locking a table: as far as I understood from the code, each
mapper opens its own connection to the DB. I didn't see any code such that
the job creates a transaction and passes it to the mapper. Did I
miss something?
again, thanks!


On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <ni...@amd.com> wrote:

> Hi Yaron
>
> Replies inline below.
>
>
> On 09/11/2012 07:41 AM, Yaron Gonen wrote:
>
>> Hi,
>> After reviewing the class's (not very complicated) code, I have some
>> questions I hope someone can answer:
>>
>>   * (more general question) Are there many use-cases for using
>>
>>     DBInputFormat? Do most Hadoop jobs take their input from files or DBs?
>>
>>  Bejoy's right, most jobs utilize data across HDFS or some other
> distributed architecture to feed M/R at a sufficient rate. DBInputFormat
> could be helpful in pulling pointers to other sources of data (e.g. file
> paths for filers where actual binary content is stored).
>
>>
>>   * What happens when the database is updated during mappers' data
>>
>>     retrieval phase? is there a way to lock the database before the
>>     data retrieval phase and release it afterwords?
>>
>>  The whole job creates a transaction against the RBDMS that ensures
> consistent state throughout the job.  Depending on the source and settings,
> this might entirely lock a table or lock the selected rows by the query.
>
>>
>>   * Since all mappers open a connection to the same DBS, one cannot
>>
>>     use hundreds of mapper. Is there a solution to this problem?
>>
>>  Depends on the connection limits and the number of rows requested. I've
> found that the server suffered other problems first before connection count
> limitations.
>
>>
>> Thanks,
>> Yaron
>>
>
>
>

Unsubscribe

Posted by Kunaal <ku...@gmail.com>.

Unsubscribe

Posted by Kunaal <ku...@gmail.com>.

Re: Some general questions about DBInputFormat

Posted by Yaron Gonen <ya...@gmail.com>.
Thanks for the fast response.
Nick, regarding locking a table: as far as I understood from the code, each
mapper opens its own connection to the DB. I didn't see any code such that
the job creates a transaction and passes it to the mapper. Did I
miss something?
again, thanks!


On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <ni...@amd.com> wrote:

> Hi Yaron
>
> Replies inline below.
>
>
> On 09/11/2012 07:41 AM, Yaron Gonen wrote:
>
>> Hi,
>> After reviewing the class's (not very complicated) code, I have some
>> questions I hope someone can answer:
>>
>>   * (more general question) Are there many use-cases for using
>>
>>     DBInputFormat? Do most Hadoop jobs take their input from files or DBs?
>>
>>  Bejoy's right, most jobs utilize data across HDFS or some other
> distributed architecture to feed M/R at a sufficient rate. DBInputFormat
> could be helpful in pulling pointers to other sources of data (e.g. file
> paths for filers where actual binary content is stored).
>
>>
>>   * What happens when the database is updated during mappers' data
>>
>>     retrieval phase? is there a way to lock the database before the
>>     data retrieval phase and release it afterwords?
>>
>>  The whole job creates a transaction against the RBDMS that ensures
> consistent state throughout the job.  Depending on the source and settings,
> this might entirely lock a table or lock the selected rows by the query.
>
>>
>>   * Since all mappers open a connection to the same DBS, one cannot
>>
>>     use hundreds of mapper. Is there a solution to this problem?
>>
>>  Depends on the connection limits and the number of rows requested. I've
> found that the server suffered other problems first before connection count
> limitations.
>
>>
>> Thanks,
>> Yaron
>>
>
>
>

Re: Some general questions about DBInputFormat

Posted by Yaron Gonen <ya...@gmail.com>.
Thanks for the fast response.
Nick, regarding locking a table: as far as I understood from the code, each
mapper opens its own connection to the DB. I didn't see any code such that
the job creates a transaction and passes it to the mapper. Did I
miss something?
again, thanks!


On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <ni...@amd.com> wrote:

> Hi Yaron
>
> Replies inline below.
>
>
> On 09/11/2012 07:41 AM, Yaron Gonen wrote:
>
>> Hi,
>> After reviewing the class's (not very complicated) code, I have some
>> questions I hope someone can answer:
>>
>>   * (more general question) Are there many use-cases for using
>>
>>     DBInputFormat? Do most Hadoop jobs take their input from files or DBs?
>>
>>  Bejoy's right, most jobs utilize data across HDFS or some other
> distributed architecture to feed M/R at a sufficient rate. DBInputFormat
> could be helpful in pulling pointers to other sources of data (e.g. file
> paths for filers where actual binary content is stored).
>
>>
>>   * What happens when the database is updated during mappers' data
>>
>>     retrieval phase? is there a way to lock the database before the
>>     data retrieval phase and release it afterwords?
>>
>>  The whole job creates a transaction against the RBDMS that ensures
> consistent state throughout the job.  Depending on the source and settings,
> this might entirely lock a table or lock the selected rows by the query.
>
>>
>>   * Since all mappers open a connection to the same DBS, one cannot
>>
>>     use hundreds of mapper. Is there a solution to this problem?
>>
>>  Depends on the connection limits and the number of rows requested. I've
> found that the server suffered other problems first before connection count
> limitations.
>
>>
>> Thanks,
>> Yaron
>>
>
>
>

Re: Some general questions about DBInputFormat

Posted by Nick Jones <ni...@amd.com>.
Hi Yaron

Replies inline below.

On 09/11/2012 07:41 AM, Yaron Gonen wrote:
> Hi,
> After reviewing the class's (not very complicated) code, I have some 
> questions I hope someone can answer:
>
>   * (more general question) Are there many use-cases for using
>     DBInputFormat? Do most Hadoop jobs take their input from files or DBs?
>
Bejoy's right, most jobs utilize data across HDFS or some other 
distributed architecture to feed M/R at a sufficient rate. DBInputFormat 
could be helpful in pulling pointers to other sources of data (e.g. file 
paths for filers where actual binary content is stored).
>
>   * What happens when the database is updated during mappers' data
>     retrieval phase? is there a way to lock the database before the
>     data retrieval phase and release it afterwords?
>
The whole job creates a transaction against the RBDMS that ensures 
consistent state throughout the job.  Depending on the source and 
settings, this might entirely lock a table or lock the selected rows by 
the query.
>
>   * Since all mappers open a connection to the same DBS, one cannot
>     use hundreds of mapper. Is there a solution to this problem?
>
Depends on the connection limits and the number of rows requested. I've 
found that the server suffered other problems first before connection 
count limitations.
>
> Thanks,
> Yaron



Re: Some general questions about DBInputFormat

Posted by Bejoy KS <be...@gmail.com>.
Hi Yaron

Sqoop uses a similar implementation. You can get some details there.

Replies inline
• (more general question) Are there many use-cases for using DBInputFormat? Do most Hadoop jobs take their input from files or DBs?

> From my small experience Most MR jobs have data in hdfs. It is useful for getting data out of rdbms to hadoop, sqoop implemenation is an example.


• Since all mappers open a connection to the same DBS, one cannot use hundreds of mapper. Is there a solution to this problem? 

>Num of mappers shouldn't be more than the permissible number of connections allowed for that db. 



Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Yaron Gonen <ya...@gmail.com>
Date: Tue, 11 Sep 2012 15:41:26 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Some general questions about DBInputFormat

Hi,
After reviewing the class's (not very complicated) code, I have some
questions I hope someone can answer:

   - (more general question) Are there many use-cases for using
   DBInputFormat? Do most Hadoop jobs take their input from files or DBs?
   - What happens when the database is updated during mappers' data
   retrieval phase? is there a way to lock the database before the data
   retrieval phase and release it afterwords?
   - Since all mappers open a connection to the same DBS, one cannot use
   hundreds of mapper. Is there a solution to this problem?

Thanks,
Yaron


Re: Some general questions about DBInputFormat

Posted by Bejoy KS <be...@gmail.com>.
Hi Yaron

Sqoop uses a similar implementation. You can get some details there.

Replies inline
• (more general question) Are there many use-cases for using DBInputFormat? Do most Hadoop jobs take their input from files or DBs?

> From my small experience Most MR jobs have data in hdfs. It is useful for getting data out of rdbms to hadoop, sqoop implemenation is an example.


• Since all mappers open a connection to the same DBS, one cannot use hundreds of mapper. Is there a solution to this problem? 

>Num of mappers shouldn't be more than the permissible number of connections allowed for that db. 



Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Yaron Gonen <ya...@gmail.com>
Date: Tue, 11 Sep 2012 15:41:26 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Some general questions about DBInputFormat

Hi,
After reviewing the class's (not very complicated) code, I have some
questions I hope someone can answer:

   - (more general question) Are there many use-cases for using
   DBInputFormat? Do most Hadoop jobs take their input from files or DBs?
   - What happens when the database is updated during mappers' data
   retrieval phase? is there a way to lock the database before the data
   retrieval phase and release it afterwords?
   - Since all mappers open a connection to the same DBS, one cannot use
   hundreds of mapper. Is there a solution to this problem?

Thanks,
Yaron


Re: Some general questions about DBInputFormat

Posted by Bejoy KS <be...@gmail.com>.
Hi Yaron

Sqoop uses a similar implementation. You can get some details there.

Replies inline
• (more general question) Are there many use-cases for using DBInputFormat? Do most Hadoop jobs take their input from files or DBs?

> From my small experience Most MR jobs have data in hdfs. It is useful for getting data out of rdbms to hadoop, sqoop implemenation is an example.


• Since all mappers open a connection to the same DBS, one cannot use hundreds of mapper. Is there a solution to this problem? 

>Num of mappers shouldn't be more than the permissible number of connections allowed for that db. 



Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Yaron Gonen <ya...@gmail.com>
Date: Tue, 11 Sep 2012 15:41:26 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Some general questions about DBInputFormat

Hi,
After reviewing the class's (not very complicated) code, I have some
questions I hope someone can answer:

   - (more general question) Are there many use-cases for using
   DBInputFormat? Do most Hadoop jobs take their input from files or DBs?
   - What happens when the database is updated during mappers' data
   retrieval phase? is there a way to lock the database before the data
   retrieval phase and release it afterwords?
   - Since all mappers open a connection to the same DBS, one cannot use
   hundreds of mapper. Is there a solution to this problem?

Thanks,
Yaron


Re: Some general questions about DBInputFormat

Posted by Nick Jones <ni...@amd.com>.
Hi Yaron

Replies inline below.

On 09/11/2012 07:41 AM, Yaron Gonen wrote:
> Hi,
> After reviewing the class's (not very complicated) code, I have some 
> questions I hope someone can answer:
>
>   * (more general question) Are there many use-cases for using
>     DBInputFormat? Do most Hadoop jobs take their input from files or DBs?
>
Bejoy's right, most jobs utilize data across HDFS or some other 
distributed architecture to feed M/R at a sufficient rate. DBInputFormat 
could be helpful in pulling pointers to other sources of data (e.g. file 
paths for filers where actual binary content is stored).
>
>   * What happens when the database is updated during mappers' data
>     retrieval phase? is there a way to lock the database before the
>     data retrieval phase and release it afterwords?
>
The whole job creates a transaction against the RBDMS that ensures 
consistent state throughout the job.  Depending on the source and 
settings, this might entirely lock a table or lock the selected rows by 
the query.
>
>   * Since all mappers open a connection to the same DBS, one cannot
>     use hundreds of mapper. Is there a solution to this problem?
>
Depends on the connection limits and the number of rows requested. I've 
found that the server suffered other problems first before connection 
count limitations.
>
> Thanks,
> Yaron



Re: Some general questions about DBInputFormat

Posted by Bejoy KS <be...@gmail.com>.
Hi Yaron

Sqoop uses a similar implementation. You can get some details there.

Replies inline
• (more general question) Are there many use-cases for using DBInputFormat? Do most Hadoop jobs take their input from files or DBs?

> From my small experience Most MR jobs have data in hdfs. It is useful for getting data out of rdbms to hadoop, sqoop implemenation is an example.


• Since all mappers open a connection to the same DBS, one cannot use hundreds of mapper. Is there a solution to this problem? 

>Num of mappers shouldn't be more than the permissible number of connections allowed for that db. 



Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Yaron Gonen <ya...@gmail.com>
Date: Tue, 11 Sep 2012 15:41:26 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Some general questions about DBInputFormat

Hi,
After reviewing the class's (not very complicated) code, I have some
questions I hope someone can answer:

   - (more general question) Are there many use-cases for using
   DBInputFormat? Do most Hadoop jobs take their input from files or DBs?
   - What happens when the database is updated during mappers' data
   retrieval phase? is there a way to lock the database before the data
   retrieval phase and release it afterwords?
   - Since all mappers open a connection to the same DBS, one cannot use
   hundreds of mapper. Is there a solution to this problem?

Thanks,
Yaron


Re: Some general questions about DBInputFormat

Posted by Nick Jones <ni...@amd.com>.
Hi Yaron

Replies inline below.

On 09/11/2012 07:41 AM, Yaron Gonen wrote:
> Hi,
> After reviewing the class's (not very complicated) code, I have some 
> questions I hope someone can answer:
>
>   * (more general question) Are there many use-cases for using
>     DBInputFormat? Do most Hadoop jobs take their input from files or DBs?
>
Bejoy's right, most jobs utilize data across HDFS or some other 
distributed architecture to feed M/R at a sufficient rate. DBInputFormat 
could be helpful in pulling pointers to other sources of data (e.g. file 
paths for filers where actual binary content is stored).
>
>   * What happens when the database is updated during mappers' data
>     retrieval phase? is there a way to lock the database before the
>     data retrieval phase and release it afterwords?
>
The whole job creates a transaction against the RBDMS that ensures 
consistent state throughout the job.  Depending on the source and 
settings, this might entirely lock a table or lock the selected rows by 
the query.
>
>   * Since all mappers open a connection to the same DBS, one cannot
>     use hundreds of mapper. Is there a solution to this problem?
>
Depends on the connection limits and the number of rows requested. I've 
found that the server suffered other problems first before connection 
count limitations.
>
> Thanks,
> Yaron



Re: Some general questions about DBInputFormat

Posted by Nick Jones <ni...@amd.com>.
Hi Yaron

Replies inline below.

On 09/11/2012 07:41 AM, Yaron Gonen wrote:
> Hi,
> After reviewing the class's (not very complicated) code, I have some 
> questions I hope someone can answer:
>
>   * (more general question) Are there many use-cases for using
>     DBInputFormat? Do most Hadoop jobs take their input from files or DBs?
>
Bejoy's right, most jobs utilize data across HDFS or some other 
distributed architecture to feed M/R at a sufficient rate. DBInputFormat 
could be helpful in pulling pointers to other sources of data (e.g. file 
paths for filers where actual binary content is stored).
>
>   * What happens when the database is updated during mappers' data
>     retrieval phase? is there a way to lock the database before the
>     data retrieval phase and release it afterwords?
>
The whole job creates a transaction against the RBDMS that ensures 
consistent state throughout the job.  Depending on the source and 
settings, this might entirely lock a table or lock the selected rows by 
the query.
>
>   * Since all mappers open a connection to the same DBS, one cannot
>     use hundreds of mapper. Is there a solution to this problem?
>
Depends on the connection limits and the number of rows requested. I've 
found that the server suffered other problems first before connection 
count limitations.
>
> Thanks,
> Yaron