You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Jun Rao <ju...@gmail.com> on 2014/05/17 01:04:02 UTC
Re: NFS and/or local filesystem consumer?
You probably would have to write a consumer app to dump data in binary form
to GPFS or NFS, since the HDFS api is very special.
Thanks,
Jun
On Fri, May 16, 2014 at 8:17 AM, Carlile, Ken <ca...@janelia.hhmi.org>wrote:
> Hi all,
>
> Sorry for the possible repost--hadn't seen this in the list after 18 hours
> and figured I'd try again....
>
> We are experimenting as using Kafka as a midpoint between microscopes and
> a Spark cluster for data analysis. Our microscopes almost universally use
> Windows machines for acquisition (as do most scientific instruments), and
> our compute cluster (which runs Spark among many other things) runs Linux.
> We use Isilon for file storage primarily, although we also have a GPFS
> cluster for HPC.
>
> We have a working http post system going into Kafka from the Windows
> acquisition machine, which is performing more reliably and faster than an
> SMB connection to the Isilon or GPFS clusters. Unfortunately, the Spark
> streaming consumer is much slower than reading from disk (Isilon or GPFS)
> on the Spark cluster.
>
> My proposal would be to not only improve the Spark streaming, but also to
> have a consumer (or multiple consumers!) that writes to disk, either over
> NFS or "locally" via a GPFS client.
>
> As I am a systems engineer, I'm not equipped to write this, so I'm
> wondering if anyone has done this sort of thing with Kafka before. I know
> there are HDFS consumers out there, and our Isilons can do HDFS, but the
> implementation on the Isilon is very limited at this time, and the ability
> to write to local filesystem or NFS would give us much more flexibility.
>
> Ideally, I would like to be able to use Kafka as a high speed transfer
> point between acquisition instruments (usually running Windows) and several
> kinds of storage, so that we could write virtually simultaneously to
> archive storage for the raw data and to HPC scratch for data analysis,
> thereby limiting the penalty incurred from data movement between storage
> tiers.
>
> Thanks for any input you have,
>
> --Ken
Re: NFS and/or local filesystem consumer?
Posted by Jun Rao <ju...@gmail.com>.
Ken,
We don't have sth like that now. It shouldn't be too hard to write one
though. You probably need some kind of time-based partitioning to different
files.
Thanks,
Jun
On Fri, May 16, 2014 at 4:40 PM, Carlile, Ken <ca...@janelia.hhmi.org>wrote:
> Hi Jun,
>
> I was wondering if there was something out there already. GPFS appears to
> the OS as local filesystem, so if there was a consumer that dumped to local
> filesystem, we'd be gold.
>
> Thanks,
> --Ken
>
> On May 16, 2014, at 7:04 PM, Jun Rao <ju...@gmail.com> wrote:
>
> > You probably would have to write a consumer app to dump data in binary
> form
> > to GPFS or NFS, since the HDFS api is very special.
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Fri, May 16, 2014 at 8:17 AM, Carlile, Ken <carlilek@janelia.hhmi.org
> >wrote:
> >
> >> Hi all,
> >>
> >> Sorry for the possible repost--hadn't seen this in the list after 18
> hours
> >> and figured I'd try again....
> >>
> >> We are experimenting as using Kafka as a midpoint between microscopes
> and
> >> a Spark cluster for data analysis. Our microscopes almost universally
> use
> >> Windows machines for acquisition (as do most scientific instruments),
> and
> >> our compute cluster (which runs Spark among many other things) runs
> Linux.
> >> We use Isilon for file storage primarily, although we also have a GPFS
> >> cluster for HPC.
> >>
> >> We have a working http post system going into Kafka from the Windows
> >> acquisition machine, which is performing more reliably and faster than
> an
> >> SMB connection to the Isilon or GPFS clusters. Unfortunately, the Spark
> >> streaming consumer is much slower than reading from disk (Isilon or
> GPFS)
> >> on the Spark cluster.
> >>
> >> My proposal would be to not only improve the Spark streaming, but also
> to
> >> have a consumer (or multiple consumers!) that writes to disk, either
> over
> >> NFS or "locally" via a GPFS client.
> >>
> >> As I am a systems engineer, I'm not equipped to write this, so I'm
> >> wondering if anyone has done this sort of thing with Kafka before. I
> know
> >> there are HDFS consumers out there, and our Isilons can do HDFS, but the
> >> implementation on the Isilon is very limited at this time, and the
> ability
> >> to write to local filesystem or NFS would give us much more flexibility.
> >>
> >> Ideally, I would like to be able to use Kafka as a high speed transfer
> >> point between acquisition instruments (usually running Windows) and
> several
> >> kinds of storage, so that we could write virtually simultaneously to
> >> archive storage for the raw data and to HPC scratch for data analysis,
> >> thereby limiting the penalty incurred from data movement between storage
> >> tiers.
> >>
> >> Thanks for any input you have,
> >>
> >> --Ken
>
>
Re: NFS and/or local filesystem consumer?
Posted by "Carlile, Ken" <ca...@janelia.hhmi.org>.
Hi Jun,
I was wondering if there was something out there already. GPFS appears to the OS as local filesystem, so if there was a consumer that dumped to local filesystem, we'd be gold.
Thanks,
--Ken
On May 16, 2014, at 7:04 PM, Jun Rao <ju...@gmail.com> wrote:
> You probably would have to write a consumer app to dump data in binary form
> to GPFS or NFS, since the HDFS api is very special.
>
> Thanks,
>
> Jun
>
>
> On Fri, May 16, 2014 at 8:17 AM, Carlile, Ken <ca...@janelia.hhmi.org>wrote:
>
>> Hi all,
>>
>> Sorry for the possible repost--hadn't seen this in the list after 18 hours
>> and figured I'd try again....
>>
>> We are experimenting as using Kafka as a midpoint between microscopes and
>> a Spark cluster for data analysis. Our microscopes almost universally use
>> Windows machines for acquisition (as do most scientific instruments), and
>> our compute cluster (which runs Spark among many other things) runs Linux.
>> We use Isilon for file storage primarily, although we also have a GPFS
>> cluster for HPC.
>>
>> We have a working http post system going into Kafka from the Windows
>> acquisition machine, which is performing more reliably and faster than an
>> SMB connection to the Isilon or GPFS clusters. Unfortunately, the Spark
>> streaming consumer is much slower than reading from disk (Isilon or GPFS)
>> on the Spark cluster.
>>
>> My proposal would be to not only improve the Spark streaming, but also to
>> have a consumer (or multiple consumers!) that writes to disk, either over
>> NFS or "locally" via a GPFS client.
>>
>> As I am a systems engineer, I'm not equipped to write this, so I'm
>> wondering if anyone has done this sort of thing with Kafka before. I know
>> there are HDFS consumers out there, and our Isilons can do HDFS, but the
>> implementation on the Isilon is very limited at this time, and the ability
>> to write to local filesystem or NFS would give us much more flexibility.
>>
>> Ideally, I would like to be able to use Kafka as a high speed transfer
>> point between acquisition instruments (usually running Windows) and several
>> kinds of storage, so that we could write virtually simultaneously to
>> archive storage for the raw data and to HPC scratch for data analysis,
>> thereby limiting the penalty incurred from data movement between storage
>> tiers.
>>
>> Thanks for any input you have,
>>
>> --Ken