You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mohit Anchlia <mo...@gmail.com> on 2015/08/15 01:50:50 UTC

Too many files/dirs in hdfs

Spark stream seems to be creating 0 bytes files even when there is no data.
Also, I have 2 concerns here:

1) Extra unnecessary files is being created from the output
2) Hadoop doesn't work really well with too many files and I see that it is
creating a directory with a timestamp every 1 second. Is there a better way
of writing a file, may be use some kind of append mechanism where one
doesn't have to change the batch interval.

Re: Too many files/dirs in hdfs

Posted by Mohit Anchlia <mo...@gmail.com>.

Based on what I've read it appears that when using spark streaming there is
no good way of optimizing the files on HDFS. Spark streaming writes many
small files which is not scalable in apache hadoop. Only other way seem to
be to read files after it has been written and merge them to a bigger file,
which seems like a extra overhead from maintenance and IO perspective.

On Mon, Aug 24, 2015 at 2:51 PM, Mohit Anchlia <mo...@gmail.com>
wrote:

> Any help would be appreciated
>
> On Wed, Aug 19, 2015 at 9:38 AM, Mohit Anchlia <mo...@gmail.com>
> wrote:
>
>> My question was how to do this in Hadoop? Could somebody point me to some
>> examples?
>>
>> On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY <um...@gmail.com>
>> wrote:
>>
>>> Of course, Java or Scala can do that:
>>> 1) Create a FileWriter with append or roll over option
>>> 2) For each RDD create a StringBuilder after applying your filters
>>> 3) Write this StringBuilder to File when you want to write (The duration
>>> can be defined as a condition)
>>>
>>> On Tue, Aug 18, 2015 at 11:05 PM, Mohit Anchlia <mo...@gmail.com>
>>> wrote:
>>>
>>>> Is there a way to store all the results in one file and keep the file
>>>> roll over separate than the spark streaming batch interval?
>>>>
>>>> On Mon, Aug 17, 2015 at 2:39 AM, UMESH CHAUDHARY <um...@gmail.com>
>>>> wrote:
>>>>
>>>>> In Spark Streaming you can simply check whether your RDD contains any
>>>>> records or not and if records are there you can save them using
>>>>> FIleOutputStream:
>>>>>
>>>>> DStream.foreachRDD(t=> { var count = t.count(); if (count>0){ // SAVE
>>>>> YOUR STUFF} };
>>>>>
>>>>> This will not create unnecessary files of 0 bytes.
>>>>>
>>>>> On Mon, Aug 17, 2015 at 2:51 PM, Akhil Das <akhil@sigmoidanalytics.com
>>>>> > wrote:
>>>>>
>>>>>> Currently, spark streaming would create a new directory for every
>>>>>> batch and store the data to it (whether it has anything or not). There is
>>>>>> no direct append call as of now, but you can achieve this either with
>>>>>> FileUtil.copyMerge
>>>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167>
>>>>>> or have a separate program which will do the clean up for you.
>>>>>>
>>>>>> Thanks
>>>>>> Best Regards
>>>>>>
>>>>>> On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia <
>>>>>> mohitanchlia@gmail.com> wrote:
>>>>>>
>>>>>>> Spark stream seems to be creating 0 bytes files even when there is
>>>>>>> no data. Also, I have 2 concerns here:
>>>>>>>
>>>>>>> 1) Extra unnecessary files is being created from the output
>>>>>>> 2) Hadoop doesn't work really well with too many files and I see
>>>>>>> that it is creating a directory with a timestamp every 1 second. Is there a
>>>>>>> better way of writing a file, may be use some kind of append mechanism
>>>>>>> where one doesn't have to change the batch interval.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Too many files/dirs in hdfs

Posted by Mohit Anchlia <mo...@gmail.com>.

Any help would be appreciated

On Wed, Aug 19, 2015 at 9:38 AM, Mohit Anchlia <mo...@gmail.com>
wrote:

> My question was how to do this in Hadoop? Could somebody point me to some
> examples?
>
> On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY <um...@gmail.com>
> wrote:
>
>> Of course, Java or Scala can do that:
>> 1) Create a FileWriter with append or roll over option
>> 2) For each RDD create a StringBuilder after applying your filters
>> 3) Write this StringBuilder to File when you want to write (The duration
>> can be defined as a condition)
>>
>> On Tue, Aug 18, 2015 at 11:05 PM, Mohit Anchlia <mo...@gmail.com>
>> wrote:
>>
>>> Is there a way to store all the results in one file and keep the file
>>> roll over separate than the spark streaming batch interval?
>>>
>>> On Mon, Aug 17, 2015 at 2:39 AM, UMESH CHAUDHARY <um...@gmail.com>
>>> wrote:
>>>
>>>> In Spark Streaming you can simply check whether your RDD contains any
>>>> records or not and if records are there you can save them using
>>>> FIleOutputStream:
>>>>
>>>> DStream.foreachRDD(t=> { var count = t.count(); if (count>0){ // SAVE
>>>> YOUR STUFF} };
>>>>
>>>> This will not create unnecessary files of 0 bytes.
>>>>
>>>> On Mon, Aug 17, 2015 at 2:51 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>>> wrote:
>>>>
>>>>> Currently, spark streaming would create a new directory for every
>>>>> batch and store the data to it (whether it has anything or not). There is
>>>>> no direct append call as of now, but you can achieve this either with
>>>>> FileUtil.copyMerge
>>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167>
>>>>> or have a separate program which will do the clean up for you.
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>> On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia <mohitanchlia@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Spark stream seems to be creating 0 bytes files even when there is no
>>>>>> data. Also, I have 2 concerns here:
>>>>>>
>>>>>> 1) Extra unnecessary files is being created from the output
>>>>>> 2) Hadoop doesn't work really well with too many files and I see that
>>>>>> it is creating a directory with a timestamp every 1 second. Is there a
>>>>>> better way of writing a file, may be use some kind of append mechanism
>>>>>> where one doesn't have to change the batch interval.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Too many files/dirs in hdfs

Posted by Mohit Anchlia <mo...@gmail.com>.

My question was how to do this in Hadoop? Could somebody point me to some
examples?

On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY <um...@gmail.com>
wrote:

> Of course, Java or Scala can do that:
> 1) Create a FileWriter with append or roll over option
> 2) For each RDD create a StringBuilder after applying your filters
> 3) Write this StringBuilder to File when you want to write (The duration
> can be defined as a condition)
>
> On Tue, Aug 18, 2015 at 11:05 PM, Mohit Anchlia <mo...@gmail.com>
> wrote:
>
>> Is there a way to store all the results in one file and keep the file
>> roll over separate than the spark streaming batch interval?
>>
>> On Mon, Aug 17, 2015 at 2:39 AM, UMESH CHAUDHARY <um...@gmail.com>
>> wrote:
>>
>>> In Spark Streaming you can simply check whether your RDD contains any
>>> records or not and if records are there you can save them using
>>> FIleOutputStream:
>>>
>>> DStream.foreachRDD(t=> { var count = t.count(); if (count>0){ // SAVE
>>> YOUR STUFF} };
>>>
>>> This will not create unnecessary files of 0 bytes.
>>>
>>> On Mon, Aug 17, 2015 at 2:51 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>> wrote:
>>>
>>>> Currently, spark streaming would create a new directory for every batch
>>>> and store the data to it (whether it has anything or not). There is no
>>>> direct append call as of now, but you can achieve this either with
>>>> FileUtil.copyMerge
>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167>
>>>> or have a separate program which will do the clean up for you.
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia <mo...@gmail.com>
>>>> wrote:
>>>>
>>>>> Spark stream seems to be creating 0 bytes files even when there is no
>>>>> data. Also, I have 2 concerns here:
>>>>>
>>>>> 1) Extra unnecessary files is being created from the output
>>>>> 2) Hadoop doesn't work really well with too many files and I see that
>>>>> it is creating a directory with a timestamp every 1 second. Is there a
>>>>> better way of writing a file, may be use some kind of append mechanism
>>>>> where one doesn't have to change the batch interval.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Too many files/dirs in hdfs

Posted by UMESH CHAUDHARY <um...@gmail.com>.

Of course, Java or Scala can do that:
1) Create a FileWriter with append or roll over option
2) For each RDD create a StringBuilder after applying your filters
3) Write this StringBuilder to File when you want to write (The duration
can be defined as a condition)

On Tue, Aug 18, 2015 at 11:05 PM, Mohit Anchlia <mo...@gmail.com>
wrote:

> Is there a way to store all the results in one file and keep the file roll
> over separate than the spark streaming batch interval?
>
> On Mon, Aug 17, 2015 at 2:39 AM, UMESH CHAUDHARY <um...@gmail.com>
> wrote:
>
>> In Spark Streaming you can simply check whether your RDD contains any
>> records or not and if records are there you can save them using
>> FIleOutputStream:
>>
>> DStream.foreachRDD(t=> { var count = t.count(); if (count>0){ // SAVE
>> YOUR STUFF} };
>>
>> This will not create unnecessary files of 0 bytes.
>>
>> On Mon, Aug 17, 2015 at 2:51 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>> Currently, spark streaming would create a new directory for every batch
>>> and store the data to it (whether it has anything or not). There is no
>>> direct append call as of now, but you can achieve this either with
>>> FileUtil.copyMerge
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167>
>>> or have a separate program which will do the clean up for you.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia <mo...@gmail.com>
>>> wrote:
>>>
>>>> Spark stream seems to be creating 0 bytes files even when there is no
>>>> data. Also, I have 2 concerns here:
>>>>
>>>> 1) Extra unnecessary files is being created from the output
>>>> 2) Hadoop doesn't work really well with too many files and I see that
>>>> it is creating a directory with a timestamp every 1 second. Is there a
>>>> better way of writing a file, may be use some kind of append mechanism
>>>> where one doesn't have to change the batch interval.
>>>>
>>>
>>>
>>
>

Re: Too many files/dirs in hdfs

Posted by Mohit Anchlia <mo...@gmail.com>.

Is there a way to store all the results in one file and keep the file roll
over separate than the spark streaming batch interval?

On Mon, Aug 17, 2015 at 2:39 AM, UMESH CHAUDHARY <um...@gmail.com>
wrote:

> In Spark Streaming you can simply check whether your RDD contains any
> records or not and if records are there you can save them using
> FIleOutputStream:
>
> DStream.foreachRDD(t=> { var count = t.count(); if (count>0){ // SAVE YOUR
> STUFF} };
>
> This will not create unnecessary files of 0 bytes.
>
> On Mon, Aug 17, 2015 at 2:51 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Currently, spark streaming would create a new directory for every batch
>> and store the data to it (whether it has anything or not). There is no
>> direct append call as of now, but you can achieve this either with
>> FileUtil.copyMerge
>> <http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167>
>> or have a separate program which will do the clean up for you.
>>
>> Thanks
>> Best Regards
>>
>> On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia <mo...@gmail.com>
>> wrote:
>>
>>> Spark stream seems to be creating 0 bytes files even when there is no
>>> data. Also, I have 2 concerns here:
>>>
>>> 1) Extra unnecessary files is being created from the output
>>> 2) Hadoop doesn't work really well with too many files and I see that it
>>> is creating a directory with a timestamp every 1 second. Is there a better
>>> way of writing a file, may be use some kind of append mechanism where one
>>> doesn't have to change the batch interval.
>>>
>>
>>
>

Re: Too many files/dirs in hdfs

Posted by UMESH CHAUDHARY <um...@gmail.com>.

In Spark Streaming you can simply check whether your RDD contains any
records or not and if records are there you can save them using
FIleOutputStream:

DStream.foreachRDD(t=> { var count = t.count(); if (count>0){ // SAVE YOUR
STUFF} };

This will not create unnecessary files of 0 bytes.

On Mon, Aug 17, 2015 at 2:51 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Currently, spark streaming would create a new directory for every batch
> and store the data to it (whether it has anything or not). There is no
> direct append call as of now, but you can achieve this either with
> FileUtil.copyMerge
> <http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167>
> or have a separate program which will do the clean up for you.
>
> Thanks
> Best Regards
>
> On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia <mo...@gmail.com>
> wrote:
>
>> Spark stream seems to be creating 0 bytes files even when there is no
>> data. Also, I have 2 concerns here:
>>
>> 1) Extra unnecessary files is being created from the output
>> 2) Hadoop doesn't work really well with too many files and I see that it
>> is creating a directory with a timestamp every 1 second. Is there a better
>> way of writing a file, may be use some kind of append mechanism where one
>> doesn't have to change the batch interval.
>>
>
>

Re: Too many files/dirs in hdfs

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Currently, spark streaming would create a new directory for every batch and
store the data to it (whether it has anything or not). There is no direct
append call as of now, but you can achieve this either with
FileUtil.copyMerge
<http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167>
or have a separate program which will do the clean up for you.

Thanks
Best Regards

On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia <mo...@gmail.com>
wrote:

> Spark stream seems to be creating 0 bytes files even when there is no
> data. Also, I have 2 concerns here:
>
> 1) Extra unnecessary files is being created from the output
> 2) Hadoop doesn't work really well with too many files and I see that it
> is creating a directory with a timestamp every 1 second. Is there a better
> way of writing a file, may be use some kind of append mechanism where one
> doesn't have to change the batch interval.
>