You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Adrian Woodhead <aw...@hotels.com> on 2015/10/22 12:56:06 UTC

future of Camus?

Hello all,

We're looking at options for getting data from Kafka onto HDFS and Camus looks like the natural choice for this. It's also evident that LinkedIn who originally created Camus are taking things in a different direction and are advising people to use their Gobblin ETL framework instead. We feel that Gobblin is overkill for many simple use cases and Camus seems a much simpler and better fit. The problem now is that with LinkedIn apparently withdrawing official support for it it appears that any changes to Camus are being managed by various forks of it and it looks like everyone is building and using their own versions. Wouldn't it be better for a community to form around one official fork so development efforts can be focused on this? Any thoughts on this?

Thanks,

Adrian

Re: future of Camus?

Posted by Todd Snyder <ts...@blackberry.com>.

Another alternative is to checkout Kaboom

‎      https://github.com/blackberry/KaBoom

‎It uses a pared down kafka consumer library to pull data from Kafka and write it to defined (and somewhat dynamic) hdfs paths in a custom (and changeable) avro schema we call boom. It uses kerberos for authentication, and supports very high throughout.

It's still actively being developed, with a new release coming soon with enhanced configuration through a new rest api (kontroller).

Cheers

Todd.

Sent from my BlackBerry 10 smartphone on the TELUS network.
  Original Message
From: Guozhang Wang
Sent: Thursday, October 22, 2015 5:03 PM
To: users@kafka.apache.org
Reply To: users@kafka.apache.org
Subject: Re: future of Camus?

Hi Adrian,

Another alternative approach is to use Kafka's own Copycat framework for
data ingressing / egressing. It will be released in our 0.9.0 version
expected in Nov.

Under Copycat users can write different "connector" instantiated for
different source / sink systems, while for your case there is a in-built
HDFS connector coming along with the framework itself. You can find more
details in these Kafka wikis / java docs:

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767

https://s3-us-west-2.amazonaws.com/confluent-files/copycat-docs-wip/intro.html

Guozhang

On Thu, Oct 22, 2015 at 12:52 PM, Henry Cai <hc...@pinterest.com.invalid>
wrote:

> Take a look at secor:
>
> https://github.com/pinterest/secor
>
> Secor is a no-frill kafka->HDFS/Ingesting tool, doesn't depend on any
> underlying systems such as Hadoop, it only uses Kafka high level consumer
> to balance the work loads.  Very easy to understand and manage.  It's
> probably the 2nd most popular kafka/HDFS ingestion tool (behind camus).
> Lots of web companies use this to do the kafka data ingestion
> (Pinterest/Uber/AirBnb).
>
>
> On Thu, Oct 22, 2015 at 3:56 AM, Adrian Woodhead <aw...@hotels.com>
> wrote:
>
> > Hello all,
> >
> > We're looking at options for getting data from Kafka onto HDFS and Camus
> > looks like the natural choice for this. It's also evident that LinkedIn
> who
> > originally created Camus are taking things in a different direction and
> are
> > advising people to use their Gobblin ETL framework instead. We feel that
> > Gobblin is overkill for many simple use cases and Camus seems a much
> > simpler and better fit. The problem now is that with LinkedIn apparently
> > withdrawing official support for it it appears that any changes to Camus
> > are being managed by various forks of it and it looks like everyone is
> > building and using their own versions. Wouldn't it be better for a
> > community to form around one official fork so development efforts can be
> > focused on this? Any thoughts on this?
> >
> > Thanks,
> >
> > Adrian
> >
> >
>

--
-- Guozhang

Re: future of Camus?

Posted by Adrian Woodhead <aw...@hotels.com>.

Thanks everyone for your input on this thread, looks like a hot topic ;)

I thought I'd reply to everyone's feedback in one go rather than have lots of separate replies, so here goes...

Henry - thanks for pointing out Secor, I had never seen it before. I can see why not having a Hadoop dependency can be appealing but in our case we actually like the dependency as for Camus it means we can scale the job out on the cluster without having to do anything extra ourselves. The documentation also makes it look Secor is very S3-centric while we're interested in HDFS.

Guozhang - Copycat certainly looks very promising and again I'd never come across this. An HDFS export connector that runs on YARN would probably be what we'd be looking for and could potentially do what Camus does while being more tightly integrated with Kafka should mean it's less likely to be orphaned. We'll certainly keep an eye on this although it looks like it's probably not production ready yet? It also wasn't immediately clear how one would use it to run on YARN - our jobs are typically started on lightweight machines which have limited resources so we want to delegate as much as possible to the cluster nodes for parallelising the work with as little setup on our part as we can get away with.

Todd - we looked at Kaboom but we don't use Avro and need to control the formats of the files we create on HDFS (typically ORC and SequenceFile) along with also wanting full control over the HDFS paths where the files are created. Camus has extension points that allowed us to write our own RecordWriterProvider, Partitioner and MessageDecoder all of which we use and none of which we saw as possible in Kaboom as it currently stands. Apologies if we've overlooked something here.

Vivek - we also considered Flume/Flafka but we're actually trying to reduce the number of technologies we're using and part of the reason for us using Kafka is to have *one* standard mechanism for getting data in and out of Hadoop and the intention is for this to replace our existing Flume infrastucture. I appreciate that Flume can do the job but in terms of operational complexity we'd prefer to have fewer moving parts and we felt Camus was less complex than adding Flume to the end of the data pipeline.

So it sounds like Camus still has features that can't easily be replicated in any of the other solutions as they currently stand. It also appears that nobody here is keen on working on an official fork of Camus, possibly since they're using or working on the alternatives above? I made a similar post on the "Camus_etl" group (https://groups.google.com/forum/#!topic/camus_etl/jUkX4zC4oF0) and some parties there indicated that they would be interested in an official Camus fork or some way of keeping the current Camus codebase in existence with new features being added to it going forward so we'll see where that goes.

If anyone has any other opinions or thoughts please let me know. 

Thanks,

Adrian

________________________________________
From: vivek thakre <vi...@gmail.com>
Sent: 22 October 2015 23:44
To: users@kafka.apache.org
Subject: Re: future of Camus?

We are using Apache Flume as a router to consume data from Kafka and push
to HDFS.
With Flume 1.6, Kafka Channel, Source and Sink are available out of the box.

Here is the blog post from Cloudera
http://blog.cloudera.com/blog/2014/11/flafka-apache-flume-meets-apache-kafka-for-event-processing/

Thanks,

Vivek Thakre

On Thu, Oct 22, 2015 at 2:29 PM, Hawin Jiang <ha...@gmail.com> wrote:

> Very useful information for us.
> Thanks Guozhang.
> On Oct 22, 2015 2:02 PM, "Guozhang Wang" <wa...@gmail.com> wrote:
>
> > Hi Adrian,
> >
> > Another alternative approach is to use Kafka's own Copycat framework for
> > data ingressing / egressing. It will be released in our 0.9.0 version
> > expected in Nov.
> >
> > Under Copycat users can write different "connector" instantiated for
> > different source / sink systems, while for your case there is a in-built
> > HDFS connector coming along with the framework itself. You can find more
> > details in these Kafka wikis / java docs:
> >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767
> >
> >
> >
> https://s3-us-west-2.amazonaws.com/confluent-files/copycat-docs-wip/intro.html
> >
> > Guozhang
> >
> >
> > On Thu, Oct 22, 2015 at 12:52 PM, Henry Cai <hc...@pinterest.com.invalid>
> > wrote:
> >
> > > Take a look at secor:
> > >
> > > https://github.com/pinterest/secor
> > >
> > > Secor is a no-frill kafka->HDFS/Ingesting tool, doesn't depend on any
> > > underlying systems such as Hadoop, it only uses Kafka high level
> consumer
> > > to balance the work loads.  Very easy to understand and manage.  It's
> > > probably the 2nd most popular kafka/HDFS ingestion tool (behind camus).
> > > Lots of web companies use this to do the kafka data ingestion
> > > (Pinterest/Uber/AirBnb).
> > >
> > >
> > > On Thu, Oct 22, 2015 at 3:56 AM, Adrian Woodhead <awoodhead@hotels.com
> >
> > > wrote:
> > >
> > > > Hello all,
> > > >
> > > > We're looking at options for getting data from Kafka onto HDFS and
> > Camus
> > > > looks like the natural choice for this. It's also evident that
> LinkedIn
> > > who
> > > > originally created Camus are taking things in a different direction
> and
> > > are
> > > > advising people to use their Gobblin ETL framework instead. We feel
> > that
> > > > Gobblin is overkill for many simple use cases and Camus seems a much
> > > > simpler and better fit. The problem now is that with LinkedIn
> > apparently
> > > > withdrawing official support for it it appears that any changes to
> > Camus
> > > > are being managed by various forks of it and it looks like everyone
> is
> > > > building and using their own versions. Wouldn't it be better for a
> > > > community to form around one official fork so development efforts can
> > be
> > > > focused on this? Any thoughts on this?
> > > >
> > > > Thanks,
> > > >
> > > > Adrian
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>

Re: future of Camus?

Posted by vivek thakre <vi...@gmail.com>.

We are using Apache Flume as a router to consume data from Kafka and push
to HDFS.
With Flume 1.6, Kafka Channel, Source and Sink are available out of the box.

Here is the blog post from Cloudera
http://blog.cloudera.com/blog/2014/11/flafka-apache-flume-meets-apache-kafka-for-event-processing/

Thanks,

Vivek Thakre



On Thu, Oct 22, 2015 at 2:29 PM, Hawin Jiang <ha...@gmail.com> wrote:

> Very useful information for us.
> Thanks Guozhang.
> On Oct 22, 2015 2:02 PM, "Guozhang Wang" <wa...@gmail.com> wrote:
>
> > Hi Adrian,
> >
> > Another alternative approach is to use Kafka's own Copycat framework for
> > data ingressing / egressing. It will be released in our 0.9.0 version
> > expected in Nov.
> >
> > Under Copycat users can write different "connector" instantiated for
> > different source / sink systems, while for your case there is a in-built
> > HDFS connector coming along with the framework itself. You can find more
> > details in these Kafka wikis / java docs:
> >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767
> >
> >
> >
> https://s3-us-west-2.amazonaws.com/confluent-files/copycat-docs-wip/intro.html
> >
> > Guozhang
> >
> >
> > On Thu, Oct 22, 2015 at 12:52 PM, Henry Cai <hc...@pinterest.com.invalid>
> > wrote:
> >
> > > Take a look at secor:
> > >
> > > https://github.com/pinterest/secor
> > >
> > > Secor is a no-frill kafka->HDFS/Ingesting tool, doesn't depend on any
> > > underlying systems such as Hadoop, it only uses Kafka high level
> consumer
> > > to balance the work loads.  Very easy to understand and manage.  It's
> > > probably the 2nd most popular kafka/HDFS ingestion tool (behind camus).
> > > Lots of web companies use this to do the kafka data ingestion
> > > (Pinterest/Uber/AirBnb).
> > >
> > >
> > > On Thu, Oct 22, 2015 at 3:56 AM, Adrian Woodhead <awoodhead@hotels.com
> >
> > > wrote:
> > >
> > > > Hello all,
> > > >
> > > > We're looking at options for getting data from Kafka onto HDFS and
> > Camus
> > > > looks like the natural choice for this. It's also evident that
> LinkedIn
> > > who
> > > > originally created Camus are taking things in a different direction
> and
> > > are
> > > > advising people to use their Gobblin ETL framework instead. We feel
> > that
> > > > Gobblin is overkill for many simple use cases and Camus seems a much
> > > > simpler and better fit. The problem now is that with LinkedIn
> > apparently
> > > > withdrawing official support for it it appears that any changes to
> > Camus
> > > > are being managed by various forks of it and it looks like everyone
> is
> > > > building and using their own versions. Wouldn't it be better for a
> > > > community to form around one official fork so development efforts can
> > be
> > > > focused on this? Any thoughts on this?
> > > >
> > > > Thanks,
> > > >
> > > > Adrian
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>

Re: future of Camus?

Posted by Hawin Jiang <ha...@gmail.com>.

Very useful information for us.
Thanks Guozhang.
On Oct 22, 2015 2:02 PM, "Guozhang Wang" <wa...@gmail.com> wrote:

> Hi Adrian,
>
> Another alternative approach is to use Kafka's own Copycat framework for
> data ingressing / egressing. It will be released in our 0.9.0 version
> expected in Nov.
>
> Under Copycat users can write different "connector" instantiated for
> different source / sink systems, while for your case there is a in-built
> HDFS connector coming along with the framework itself. You can find more
> details in these Kafka wikis / java docs:
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767
>
>
> https://s3-us-west-2.amazonaws.com/confluent-files/copycat-docs-wip/intro.html
>
> Guozhang
>
>
> On Thu, Oct 22, 2015 at 12:52 PM, Henry Cai <hc...@pinterest.com.invalid>
> wrote:
>
> > Take a look at secor:
> >
> > https://github.com/pinterest/secor
> >
> > Secor is a no-frill kafka->HDFS/Ingesting tool, doesn't depend on any
> > underlying systems such as Hadoop, it only uses Kafka high level consumer
> > to balance the work loads.  Very easy to understand and manage.  It's
> > probably the 2nd most popular kafka/HDFS ingestion tool (behind camus).
> > Lots of web companies use this to do the kafka data ingestion
> > (Pinterest/Uber/AirBnb).
> >
> >
> > On Thu, Oct 22, 2015 at 3:56 AM, Adrian Woodhead <aw...@hotels.com>
> > wrote:
> >
> > > Hello all,
> > >
> > > We're looking at options for getting data from Kafka onto HDFS and
> Camus
> > > looks like the natural choice for this. It's also evident that LinkedIn
> > who
> > > originally created Camus are taking things in a different direction and
> > are
> > > advising people to use their Gobblin ETL framework instead. We feel
> that
> > > Gobblin is overkill for many simple use cases and Camus seems a much
> > > simpler and better fit. The problem now is that with LinkedIn
> apparently
> > > withdrawing official support for it it appears that any changes to
> Camus
> > > are being managed by various forks of it and it looks like everyone is
> > > building and using their own versions. Wouldn't it be better for a
> > > community to form around one official fork so development efforts can
> be
> > > focused on this? Any thoughts on this?
> > >
> > > Thanks,
> > >
> > > Adrian
> > >
> > >
> >
>
>
>
> --
> -- Guozhang
>

Re: future of Camus?

Posted by Guozhang Wang <wa...@gmail.com>.

Hi Adrian,

Another alternative approach is to use Kafka's own Copycat framework for
data ingressing / egressing. It will be released in our 0.9.0 version
expected in Nov.

Under Copycat users can write different "connector" instantiated for
different source / sink systems, while for your case there is a in-built
HDFS connector coming along with the framework itself. You can find more
details in these Kafka wikis / java docs:

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767

https://s3-us-west-2.amazonaws.com/confluent-files/copycat-docs-wip/intro.html

Guozhang


On Thu, Oct 22, 2015 at 12:52 PM, Henry Cai <hc...@pinterest.com.invalid>
wrote:

> Take a look at secor:
>
> https://github.com/pinterest/secor
>
> Secor is a no-frill kafka->HDFS/Ingesting tool, doesn't depend on any
> underlying systems such as Hadoop, it only uses Kafka high level consumer
> to balance the work loads.  Very easy to understand and manage.  It's
> probably the 2nd most popular kafka/HDFS ingestion tool (behind camus).
> Lots of web companies use this to do the kafka data ingestion
> (Pinterest/Uber/AirBnb).
>
>
> On Thu, Oct 22, 2015 at 3:56 AM, Adrian Woodhead <aw...@hotels.com>
> wrote:
>
> > Hello all,
> >
> > We're looking at options for getting data from Kafka onto HDFS and Camus
> > looks like the natural choice for this. It's also evident that LinkedIn
> who
> > originally created Camus are taking things in a different direction and
> are
> > advising people to use their Gobblin ETL framework instead. We feel that
> > Gobblin is overkill for many simple use cases and Camus seems a much
> > simpler and better fit. The problem now is that with LinkedIn apparently
> > withdrawing official support for it it appears that any changes to Camus
> > are being managed by various forks of it and it looks like everyone is
> > building and using their own versions. Wouldn't it be better for a
> > community to form around one official fork so development efforts can be
> > focused on this? Any thoughts on this?
> >
> > Thanks,
> >
> > Adrian
> >
> >
>



-- 
-- Guozhang

Re: future of Camus?

Posted by Henry Cai <hc...@pinterest.com.INVALID>.

Take a look at secor:

https://github.com/pinterest/secor

Secor is a no-frill kafka->HDFS/Ingesting tool, doesn't depend on any
underlying systems such as Hadoop, it only uses Kafka high level consumer
to balance the work loads.  Very easy to understand and manage.  It's
probably the 2nd most popular kafka/HDFS ingestion tool (behind camus).
Lots of web companies use this to do the kafka data ingestion
(Pinterest/Uber/AirBnb).

On Thu, Oct 22, 2015 at 3:56 AM, Adrian Woodhead <aw...@hotels.com>
wrote:

> Hello all,
>
> We're looking at options for getting data from Kafka onto HDFS and Camus
> looks like the natural choice for this. It's also evident that LinkedIn who
> originally created Camus are taking things in a different direction and are
> advising people to use their Gobblin ETL framework instead. We feel that
> Gobblin is overkill for many simple use cases and Camus seems a much
> simpler and better fit. The problem now is that with LinkedIn apparently
> withdrawing official support for it it appears that any changes to Camus
> are being managed by various forks of it and it looks like everyone is
> building and using their own versions. Wouldn't it be better for a
> community to form around one official fork so development efforts can be
> focused on this? Any thoughts on this?
>
> Thanks,
>
> Adrian
>
>