You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Taeyun Kim <ta...@innowireless.com> on 2015/03/31 02:41:24 UTC

Task size is large when CombineTextInputFormat is used

Hi,

 

I used CombineTextInputFormat to read many small files.

The Java code is as follows (I've written it as an utility function):

 

    public static JavaRDD<String> combineTextFile(JavaSparkContext sc,
String path, long maxSplitSize, boolean recursive)

    {

        Configuration conf = new Configuration();

        conf.setLong(CombineTextInputFormat.SPLIT_MAXSIZE, maxSplitSize);

        if (recursive)

            conf.setBoolean(CombineTextInputFormat.INPUT_DIR_RECURSIVE,
true);

        return

            sc.newAPIHadoopFile(path, CombineTextInputFormat.class,
LongWritable.class, Text.class, conf)

            .map(new Function<Tuple2<LongWritable, Text>, String>()

            {

                @Override

                public String call(Tuple2<LongWritable, Text> tuple) throws
Exception

                {

                    return tuple._2().toString();

                }

            });

    }

 

It works, but when the program runs, the following warning is printed to the
console:

 

WARN TaskSetManager: Stage 0 contains a task of very large size (159 KB).
The maximum recommended task size is 100 KB.

 

The program reads about 3.5MB total, and the number of the files is 1234.
The files are located in one directory.

 

Is this normal?

 

My Spark version is 1.3.

 

Thanks.