You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "David J. O'Dell" <do...@videoegg.com> on 2008/07/02 19:51:57 UTC
hadoop in the ETL process
Is anyone using hadoop for any part of the ETL process?
Given its ability to process large amounts of log files this seems like
a good fit.
--
David O'Dell
Director, Operations
e: dodell@videoegg.com
t: (415) 738-5152
180 Townsend St., Third Floor
San Francisco, CA 94107
Re: hadoop in the ETL process
Posted by Andreas Kostyrka <an...@kostyrka.org>.
On Wednesday 02 July 2008 19:51:57 David J. O'Dell wrote:
> Is anyone using hadoop for any part of the ETL process?
>
> Given its ability to process large amounts of log files this seems like
> a good fit.
Well, we are doing the following data flow:
1.) webservers upload to S3
2.) hadoop jobs get started with a number of logfiles each. We use
streaming.jar only, with a Python "framework" and a number of driver scripts
for mapping, reducing (which is usually a completely generic behaviour
assigned on a per job basis, e.g. FirstOnly, SumValues, CollectSet), and
later on applying to MySQL.
3.) the results get written to MySQL.
4.) inside the hadoop cluster certain data from MySQL that is needed for
efficient reducing (you cannot count persons by sex, if you do not know the
sex of the person), are available as a REST-style http service. Each node has
it's own squid, the http services create as much cachable content as
possible, and the squids do ICP peering against all nodes.
It works somehow find, although from time to time there are problems, e.g. my
current one is that hadoop behaves really bad on long lines. (I know it's not
exactly a trivial thing to read an arbitrary long line without knowing a
limit beforehand, OTOH, Python does manage that for me, without me especially
loosing to much sleep about it. Another of these situations where slow
highlevel languages overwhelm the lowlevel optimization champions.)
Andreas
RE: hadoop in the ETL process
Posted by Ryan Lynch <rl...@veoh.com>.
Hadoop is a great way to offload ETL jobs (especially aggregation) out
of the DB. More than likely you would want to use Hadoop as a mechanism
to create a file you can load into the database as a batch job (data
pump or sql loader with Oracle for example) outside of Hadoop entirely.
I would imagine establishing connections inside map/reduce jobs would
not be ideal.
Regards,
Ryan
-----Original Message-----
From: Chris K Wensel [mailto:chris@wensel.net]
Sent: Wednesday, July 02, 2008 11:31 AM
To: core-user@hadoop.apache.org
Subject: Re: hadoop in the ETL process
If your referring to loading an RDBMS with data on Hadoop, this is
doable. but you will need to write your own JDBC adapters to your
tables.
But you might review what you are using the RDBMS for and see if those
jobs would be better off running on Hadoop entirely, if not for most
of the processing.
ckw
On Jul 2, 2008, at 10:51 AM, David J. O'Dell wrote:
> Is anyone using hadoop for any part of the ETL process?
>
> Given its ability to process large amounts of log files this seems
> like
> a good fit.
>
> --
> David O'Dell
> Director, Operations
> e: dodell@videoegg.com
> t: (415) 738-5152
> 180 Townsend St., Third Floor
> San Francisco, CA 94107
>
--
Chris K Wensel
chris@wensel.net
http://chris.wensel.net/
http://www.cascading.org/
Re: hadoop in the ETL process
Posted by Chris K Wensel <ch...@wensel.net>.
If your referring to loading an RDBMS with data on Hadoop, this is
doable. but you will need to write your own JDBC adapters to your
tables.
But you might review what you are using the RDBMS for and see if those
jobs would be better off running on Hadoop entirely, if not for most
of the processing.
ckw
On Jul 2, 2008, at 10:51 AM, David J. O'Dell wrote:
> Is anyone using hadoop for any part of the ETL process?
>
> Given its ability to process large amounts of log files this seems
> like
> a good fit.
>
> --
> David O'Dell
> Director, Operations
> e: dodell@videoegg.com
> t: (415) 738-5152
> 180 Townsend St., Third Floor
> San Francisco, CA 94107
>
--
Chris K Wensel
chris@wensel.net
http://chris.wensel.net/
http://www.cascading.org/