You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Jun Rao <ju...@gmail.com> on 2014/05/17 01:04:02 UTC

Re: NFS and/or local filesystem consumer?

You probably would have to write a consumer app to dump data in binary form
to GPFS or NFS, since the HDFS api is very special.

Thanks,

Jun


On Fri, May 16, 2014 at 8:17 AM, Carlile, Ken <ca...@janelia.hhmi.org>wrote:

> Hi all,
>
> Sorry for the possible repost--hadn't seen this in the list after 18 hours
> and figured I'd try again....
>
> We are experimenting as using Kafka as a midpoint between microscopes and
> a Spark cluster for data analysis. Our microscopes almost universally use
> Windows machines for acquisition (as do most scientific instruments), and
> our compute cluster (which runs Spark among many other things) runs Linux.
> We use Isilon for file storage primarily, although we also have a GPFS
> cluster for HPC.
>
> We have a working http post system going into Kafka from the Windows
> acquisition machine, which is performing more reliably and faster than an
> SMB connection to the Isilon or GPFS clusters. Unfortunately, the Spark
> streaming consumer is much slower than reading from disk (Isilon or GPFS)
> on the Spark cluster.
>
> My proposal would be to not only improve the Spark streaming, but also to
> have a consumer (or multiple consumers!) that writes to disk, either over
> NFS or "locally" via a GPFS client.
>
> As I am a systems engineer, I'm not equipped to write this, so I'm
> wondering if anyone has done this sort of thing with Kafka before. I know
> there are HDFS consumers out there, and our Isilons can do HDFS, but the
> implementation on the Isilon is very limited at this time, and the ability
> to write to local filesystem or NFS would give us much more flexibility.
>
> Ideally, I would like to be able to use Kafka as a high speed transfer
> point between acquisition instruments (usually running Windows) and several
> kinds of storage, so that we could write virtually simultaneously to
> archive storage for the raw data and to HPC scratch for data analysis,
> thereby limiting the penalty incurred from data movement between storage
> tiers.
>
> Thanks for any input you have,
>
> --Ken

Re: NFS and/or local filesystem consumer?

Posted by Jun Rao <ju...@gmail.com>.
Ken,

We don't have sth like that now. It shouldn't be too hard to write one
though. You probably need some kind of time-based partitioning to different
files.

Thanks,

Jun


On Fri, May 16, 2014 at 4:40 PM, Carlile, Ken <ca...@janelia.hhmi.org>wrote:

> Hi Jun,
>
> I was wondering if there was something out there already. GPFS appears to
> the OS as local filesystem, so if there was a consumer that dumped to local
> filesystem, we'd be gold.
>
> Thanks,
> --Ken
>
> On May 16, 2014, at 7:04 PM, Jun Rao <ju...@gmail.com> wrote:
>
> > You probably would have to write a consumer app to dump data in binary
> form
> > to GPFS or NFS, since the HDFS api is very special.
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Fri, May 16, 2014 at 8:17 AM, Carlile, Ken <carlilek@janelia.hhmi.org
> >wrote:
> >
> >> Hi all,
> >>
> >> Sorry for the possible repost--hadn't seen this in the list after 18
> hours
> >> and figured I'd try again....
> >>
> >> We are experimenting as using Kafka as a midpoint between microscopes
> and
> >> a Spark cluster for data analysis. Our microscopes almost universally
> use
> >> Windows machines for acquisition (as do most scientific instruments),
> and
> >> our compute cluster (which runs Spark among many other things) runs
> Linux.
> >> We use Isilon for file storage primarily, although we also have a GPFS
> >> cluster for HPC.
> >>
> >> We have a working http post system going into Kafka from the Windows
> >> acquisition machine, which is performing more reliably and faster than
> an
> >> SMB connection to the Isilon or GPFS clusters. Unfortunately, the Spark
> >> streaming consumer is much slower than reading from disk (Isilon or
> GPFS)
> >> on the Spark cluster.
> >>
> >> My proposal would be to not only improve the Spark streaming, but also
> to
> >> have a consumer (or multiple consumers!) that writes to disk, either
> over
> >> NFS or "locally" via a GPFS client.
> >>
> >> As I am a systems engineer, I'm not equipped to write this, so I'm
> >> wondering if anyone has done this sort of thing with Kafka before. I
> know
> >> there are HDFS consumers out there, and our Isilons can do HDFS, but the
> >> implementation on the Isilon is very limited at this time, and the
> ability
> >> to write to local filesystem or NFS would give us much more flexibility.
> >>
> >> Ideally, I would like to be able to use Kafka as a high speed transfer
> >> point between acquisition instruments (usually running Windows) and
> several
> >> kinds of storage, so that we could write virtually simultaneously to
> >> archive storage for the raw data and to HPC scratch for data analysis,
> >> thereby limiting the penalty incurred from data movement between storage
> >> tiers.
> >>
> >> Thanks for any input you have,
> >>
> >> --Ken
>
>

Re: NFS and/or local filesystem consumer?

Posted by "Carlile, Ken" <ca...@janelia.hhmi.org>.
Hi Jun, 

I was wondering if there was something out there already. GPFS appears to the OS as local filesystem, so if there was a consumer that dumped to local filesystem, we'd be gold. 

Thanks,
--Ken

On May 16, 2014, at 7:04 PM, Jun Rao <ju...@gmail.com> wrote:

> You probably would have to write a consumer app to dump data in binary form
> to GPFS or NFS, since the HDFS api is very special.
> 
> Thanks,
> 
> Jun
> 
> 
> On Fri, May 16, 2014 at 8:17 AM, Carlile, Ken <ca...@janelia.hhmi.org>wrote:
> 
>> Hi all,
>> 
>> Sorry for the possible repost--hadn't seen this in the list after 18 hours
>> and figured I'd try again....
>> 
>> We are experimenting as using Kafka as a midpoint between microscopes and
>> a Spark cluster for data analysis. Our microscopes almost universally use
>> Windows machines for acquisition (as do most scientific instruments), and
>> our compute cluster (which runs Spark among many other things) runs Linux.
>> We use Isilon for file storage primarily, although we also have a GPFS
>> cluster for HPC.
>> 
>> We have a working http post system going into Kafka from the Windows
>> acquisition machine, which is performing more reliably and faster than an
>> SMB connection to the Isilon or GPFS clusters. Unfortunately, the Spark
>> streaming consumer is much slower than reading from disk (Isilon or GPFS)
>> on the Spark cluster.
>> 
>> My proposal would be to not only improve the Spark streaming, but also to
>> have a consumer (or multiple consumers!) that writes to disk, either over
>> NFS or "locally" via a GPFS client.
>> 
>> As I am a systems engineer, I'm not equipped to write this, so I'm
>> wondering if anyone has done this sort of thing with Kafka before. I know
>> there are HDFS consumers out there, and our Isilons can do HDFS, but the
>> implementation on the Isilon is very limited at this time, and the ability
>> to write to local filesystem or NFS would give us much more flexibility.
>> 
>> Ideally, I would like to be able to use Kafka as a high speed transfer
>> point between acquisition instruments (usually running Windows) and several
>> kinds of storage, so that we could write virtually simultaneously to
>> archive storage for the raw data and to HPC scratch for data analysis,
>> thereby limiting the penalty incurred from data movement between storage
>> tiers.
>> 
>> Thanks for any input you have,
>> 
>> --Ken