You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ra...@DellTeam.com on 2016/07/20 04:18:40 UTC

Storm HDFS bolt equivalent in Spark Streaming.

Dell - Internal Use - Confidential

Dell - Internal Use - Confidential
While writing to Kafka from Storm, the hdfs bolt provides a nice way to batch the messages , rotate files, file name convention etc as shown below.

Do you know of something similar in Spark Streaming or do we have to roll our own? If anyone attempted this can you throw some pointers.

Every other streaming solution like Flume and NIFI handle logic like below.

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.6/bk_storm-user-guide/content/writing-data-with-storm-hdfs-connector.html

// use "|" instead of "," for field delimiter
RecordFormat format = new DelimitedRecordFormat()
        .withFieldDelimiter("|");

// Synchronize the filesystem after every 1000 tuples
SyncPolicy syncPolicy = new CountSyncPolicy(1000);

// Rotate data files when they reach 5 MB
FileRotationPolicy rotationPolicy = new FileSizeRotationPolicy(5.0f, Units.MB);

// Use default, Storm-generated file names
FileNameFormat fileNameFormat = new DefaultFileNameFormat()
        .withPath("/foo/");


// Instantiate the HdfsBolt
HdfsBolt bolt = new HdfsBolt()
        .withFsUrl("hdfs://localhost:8020")
        .withFileNameFormat(fileNameFormat)
        .withRecordFormat(format)
        .withRotationPolicy(rotationPolicy)
        .withSyncPolicy(syncPolicy);

Re: Storm HDFS bolt equivalent in Spark Streaming.

Posted by Rabin Banerjee <de...@gmail.com>.

++Deepak,

There is also a option to use saveAsHadoopFile & saveAsNewAPIHadoopFile, In
which you can customize(filename and many things ...) the way you want to
save it. :)

Happy Sparking !!!!

Regards,
Rabin Banerjee

On Wed, Jul 20, 2016 at 10:01 AM, Deepak Sharma <de...@gmail.com>
wrote:

> In spark streaming , you have to decide the duration of micro batches to
> run.
> Once you get the micro batch , transform it as per your logic and then you
> can use saveAsTextFiles on your final RDD to write it to HDFS.
>
> Thanks
> Deepak
>
> On 20 Jul 2016 9:49 am, <Ra...@dellteam.com> wrote:
>
> *Dell - Internal Use - Confidential *
>
> *Dell - Internal Use - Confidential *
>
> While writing to Kafka from Storm, the hdfs bolt provides a nice way to
> batch the messages , rotate files, file name convention etc as shown below.
>
>
>
> Do you know of something similar in Spark Streaming or do we have to roll
> our own? If anyone attempted this can you throw some pointers.
>
>
>
> Every other streaming solution like Flume and NIFI handle logic like below.
>
>
>
>
> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.6/bk_storm-user-guide/content/writing-data-with-storm-hdfs-connector.html
>
>
>
> // use "|" instead of "," for field delimiter
>
> RecordFormat format = new DelimitedRecordFormat()
>
>         .withFieldDelimiter("|");
>
>
>
> // Synchronize the filesystem after every 1000 tuples
>
> SyncPolicy syncPolicy = new CountSyncPolicy(1000);
>
>
>
> // Rotate data files when they reach 5 MB
>
> FileRotationPolicy rotationPolicy = new FileSizeRotationPolicy(5.0f,
> Units.MB);
>
>
>
> // Use default, Storm-generated file names
>
> FileNameFormat fileNameFormat = new DefaultFileNameFormat()
>
>         .withPath("/foo/");
>
>
>
>
>
> // Instantiate the HdfsBolt
>
> HdfsBolt bolt = new HdfsBolt()
>
>         .withFsUrl("hdfs://localhost:8020")
>
>         .withFileNameFormat(fileNameFormat)
>
>         .withRecordFormat(format)
>
>         .withRotationPolicy(rotationPolicy)
>
>         .withSyncPolicy(syncPolicy);
>
>
>
>
>
>
>

Re: Storm HDFS bolt equivalent in Spark Streaming.

Posted by Deepak Sharma <de...@gmail.com>.

In spark streaming , you have to decide the duration of micro batches to
run.
Once you get the micro batch , transform it as per your logic and then you
can use saveAsTextFiles on your final RDD to write it to HDFS.

Thanks
Deepak

On 20 Jul 2016 9:49 am, <Ra...@dellteam.com> wrote:

*Dell - Internal Use - Confidential *

*Dell - Internal Use - Confidential *

While writing to Kafka from Storm, the hdfs bolt provides a nice way to
batch the messages , rotate files, file name convention etc as shown below.



Do you know of something similar in Spark Streaming or do we have to roll
our own? If anyone attempted this can you throw some pointers.



Every other streaming solution like Flume and NIFI handle logic like below.



https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.6/bk_storm-user-guide/content/writing-data-with-storm-hdfs-connector.html



// use "|" instead of "," for field delimiter

RecordFormat format = new DelimitedRecordFormat()

        .withFieldDelimiter("|");



// Synchronize the filesystem after every 1000 tuples

SyncPolicy syncPolicy = new CountSyncPolicy(1000);



// Rotate data files when they reach 5 MB

FileRotationPolicy rotationPolicy = new FileSizeRotationPolicy(5.0f,
Units.MB);



// Use default, Storm-generated file names

FileNameFormat fileNameFormat = new DefaultFileNameFormat()

        .withPath("/foo/");





// Instantiate the HdfsBolt

HdfsBolt bolt = new HdfsBolt()

        .withFsUrl("hdfs://localhost:8020")

        .withFileNameFormat(fileNameFormat)

        .withRecordFormat(format)

        .withRotationPolicy(rotationPolicy)

        .withSyncPolicy(syncPolicy);