You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by "Carlile, Ken" <ca...@janelia.hhmi.org> on 2014/05/15 23:25:30 UTC

NFS and/or local filesystem consumer?

Hi all, 

We are experimenting as using Kafka as a midpoint between microscopes and a Spark cluster for data analysis. Our microscopes almost universally use Windows machines for acquisition (as do most scientific instruments), and our compute cluster (which runs Spark among many other things) runs Linux. We use Isilon for file storage primarily, although we also have a GPFS cluster for HPC. 

We have a working http post system going into Kafka from the Windows acquisition machine, which is performing more reliably and faster than an SMB connection to the Isilon or GPFS clusters. Unfortunately, the Spark streaming consumer is much slower than reading from disk (Isilon or GPFS) on the Spark cluster. 

My proposal would be to not only improve the Spark streaming, but also to have a consumer (or multiple consumers!) that writes to disk, either over NFS or "locally" via a GPFS client. 

As I am a systems engineer, I'm not equipped to write this, so I'm wondering if anyone has done this sort of thing with Kafka before. I know there are HDFS consumers out there, and our Isilons can do HDFS, but the implementation on the Isilon is very limited at this time, and the ability to write to local filesystem or NFS would give us much more flexibility. 

Ideally, I would like to be able to use Kafka as a high speed transfer point between acquisition instruments (usually running Windows) and several kinds of storage, so that we could write virtually simultaneously to archive storage for the raw data and to HPC scratch for data analysis, thereby limiting the penalty incurred from data movement between storage tiers. 

Thanks for any input you have,

--Ken