You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Sanket Reddy (JIRA)" <ji...@apache.org> on 2015/09/01 15:36:45 UTC

[jira] [Commented] (STORM-828) HdfsBolt takes a lot of configuration, need good defaults

    [ https://issues.apache.org/jira/browse/STORM-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725399#comment-14725399 ] 

Sanket Reddy commented on STORM-828:
------------------------------------

I could make it more specific by making the delimited record format to "," for CSV and set a default record format for TSV. But you are right, it might depend on whether we want to make it more specific or leave the user to set the config. It would be helpful if you can give your opinion on how I should proceed

> HdfsBolt takes a lot of configuration, need good defaults
> ---------------------------------------------------------
>
>                 Key: STORM-828
>                 URL: https://issues.apache.org/jira/browse/STORM-828
>             Project: Apache Storm
>          Issue Type: Improvement
>            Reporter: Robert Joseph Evans
>            Assignee: Sanket Reddy
>
> The following is code from https://github.com/apache/storm/blob/master/external/storm-hdfs/src/test/java/org/apache/storm/hdfs/bolt/HdfsFileTopology.java representing the amount of configuration required to use the HdfsBolt. 
> {code}
>         // sync the filesystem after every 1k tuples
>         SyncPolicy syncPolicy = new CountSyncPolicy(1000);
>         // rotate files every 1 min
>         FileRotationPolicy rotationPolicy = new TimedRotationPolicy(1.0f, TimedRotationPolicy.TimeUnit.MINUTES);
>         FileNameFormat fileNameFormat = new DefaultFileNameFormat()
>                 .withPath("/tmp/foo/")
>                 .withExtension(".txt");
>         RecordFormat format = new DelimitedRecordFormat()
>                 .withFieldDelimiter("|");
>         Yaml yaml = new Yaml();
>         InputStream in = new FileInputStream(args[1]);
>         Map<String, Object> yamlConf = (Map<String, Object>) yaml.load(in);
>         in.close();
>         config.put("hdfs.config", yamlConf);
>         HdfsBolt bolt = new HdfsBolt()
>                 .withConfigKey("hdfs.config")
>                 .withFsUrl(args[0])
>                 .withFileNameFormat(fileNameFormat)
>                 .withRecordFormat(format)
>                 .withRotationPolicy(rotationPolicy)
>                 .withSyncPolicy(syncPolicy)
>                 .addRotationAction(new MoveFileAction().toDestination("/tmp/dest2/"));
> {code}
> This is way too much.  If it were just an example showing all of the possibilities that would be OK but of the 8 lines used in the construction of the bolt, 5 of them are required or the bolt will blow up at run time.  We should provide reasonable defaults for everything that can have a reasonable default.  And required parameters should be passed in through the constructor, not as builder arguments.  I realize we need to maintain backwards compatibility so we may need some new Bolt definitions.
> {code}
> HdfsTSVBolt bolt = new HdfsTSVBolt(outputDir);
> {code}
> If someone wanted to sync every 100 records instead of every 1000 we could do
> {code}
> TSVFileBolt bolt = new TSVFileBolt(outputDir).withSyncPolicy(new CountSyncPolicy(100))
> {code}
> I would like to see a base HdfsFileBolt that requires a record format, and an output directory.  It would have defaults for everything else.  Then we could have a TSVFileBolt and CSVFileBolt subclass it and ideally SequenceFileBolt as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)