You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Marko Dinic <ma...@nissatech.com> on 2015/04/24 10:53:29 UTC

Large number of small files

Hello,

I'm not sure if this is the place to ask this question, but I'm still 
hopping for an answer/advice.

Large number of small files are uploaded, about 8KB. I am aware that 
this is not something that you're hopping for when working with Hadoop.

I was thinking about using HAR files and combined input, or sequence 
files. The problem is, files are timestamped, and I need different 
subset in different time, for example - one job needs to run on files 
that are uploaded during last 3 months, while next job might consider 
last 6 months. Naturally, as time passes different subset of files is 
needed.

This means that I would need to make a sequence file (or a HAR) each 
time I run a job, to have smaller number of mappers. On the other hand, 
I need the original files so I could subset them. This means that 
DataNode is at constant pressure, saving all of this in its memory.

How can I solve this problem?

I was also considering using Cassandra, or something like that, and to 
save the file content inside of it, instead of saving it to files on 
HDFS. FIle content is actually some measurement, that is, a vector of 
numbers, with some metadata.

Thanks

Re: Large number of small files

Posted by Harshit Mathur <ma...@gmail.com>.

Use can use combine file input format, it will save the large number of
mappers. And as per my understanding of your problem that one of your job
will run of past 3 months of data while other might run on past 6 months
data.
For this you can run a pre processing job which will just output a file
having the paths of the files within your desired time range of 3 months or
6 months. Then use multiple file input format to pass all these paths in
your mapper.



On Fri, Apr 24, 2015 at 3:03 PM, Chandra Mohan, Ananda Vel Murugan <
Ananda.Murugan@honeywell.com> wrote:

>  Marko,
>
>
>
> Parquet file would be created once when you load the data. You don’t have
> to store your small files in HDFS just for the reason of subseting the data
> by time range. You can store data and metadata in same Parquet file. As
> already pointed out, parquet files work well other tools in Hadoop
> ecosystem. Apart from performance of your map reduce jobs, other aspect is
> storage efficiency. Serialization formats like Avro and Parquet provide
> better compression and hence data occupies less space.
>
>
>
> Regards,
>
> Anand
>
>
>
> *From:* Alexander Alten-Lorenz [mailto:wget.null@gmail.com]
> *Sent:* Friday, April 24, 2015 2:49 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Large number of small files
>
>
>
> Marko,
>
>
>
> Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons wouldn't
> be discussed here.
>
>
>
> Parquet is an columnar based storage format. It is - high level - a bit
> like a NoSQL DB, but on the storage level. it allows users to "query" the
> data with MR, Pig or similar tools. Additionally, Parquet works perfectly
> with Hive and Cloudera Impala as well as Apache Dremel.
>
>
>
> https://parquet.incubator.apache.org/documentation/latest/
>
>
> http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html
>
>
> https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table
>
>
>
>
> --
>
> Alexander Alten-Lorenz
> m: wget.null@gmail.com
> b: mapredit.blogspot.com
>
>
>
>  On Apr 24, 2015, at 11:10 AM, Marko Dinic <ma...@nissatech.com>
> wrote:
>
>
>
> Anand,
>
> Thank you for your answer, but wouldn't that mean that I would have to
> serialize the files each time I need to run the job? And I would still need
> to save the original files, so the NameNode still needs to take care of
> them?
>
> Please correct me if I'm missing something, I'm not very experienced with
> Hadoop.
>
> What do you think about using Cassandra?
>
> Thanks
>
> On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan
> wrote:
>
>  Apart from databases like Cassandra, you may check serialization formats
> like Avro or Parquet
>
> Regards,
> Anand
>
> -----Original Message-----
> From: Marko Dinic [mailto:marko.dinic@nissatech.com
> <ma...@nissatech.com>]
> Sent: Friday, April 24, 2015 2:23 PM
> To: user@hadoop.apache.org
> Subject: Large number of small files
>
> Hello,
>
> I'm not sure if this is the place to ask this question, but I'm still
> hopping for an answer/advice.
>
> Large number of small files are uploaded, about 8KB. I am aware that this
> is not something that you're hopping for when working with Hadoop.
>
> I was thinking about using HAR files and combined input, or sequence
> files. The problem is, files are timestamped, and I need different subset
> in different time, for example - one job needs to run on files that are
> uploaded during last 3 months, while next job might consider last 6 months.
> Naturally, as time passes different subset of files is needed.
>
> This means that I would need to make a sequence file (or a HAR) each time
> I run a job, to have smaller number of mappers. On the other hand, I need
> the original files so I could subset them. This means that DataNode is at
> constant pressure, saving all of this in its memory.
>
> How can I solve this problem?
>
> I was also considering using Cassandra, or something like that, and to
> save the file content inside of it, instead of saving it to files on HDFS.
> FIle content is actually some measurement, that is, a vector of numbers,
> with some metadata.
>
> Thanks
>
>
>



-- 
Harshit Mathur

Re: Large number of small files

Posted by Harshit Mathur <ma...@gmail.com>.

Use can use combine file input format, it will save the large number of
mappers. And as per my understanding of your problem that one of your job
will run of past 3 months of data while other might run on past 6 months
data.
For this you can run a pre processing job which will just output a file
having the paths of the files within your desired time range of 3 months or
6 months. Then use multiple file input format to pass all these paths in
your mapper.



On Fri, Apr 24, 2015 at 3:03 PM, Chandra Mohan, Ananda Vel Murugan <
Ananda.Murugan@honeywell.com> wrote:

>  Marko,
>
>
>
> Parquet file would be created once when you load the data. You don’t have
> to store your small files in HDFS just for the reason of subseting the data
> by time range. You can store data and metadata in same Parquet file. As
> already pointed out, parquet files work well other tools in Hadoop
> ecosystem. Apart from performance of your map reduce jobs, other aspect is
> storage efficiency. Serialization formats like Avro and Parquet provide
> better compression and hence data occupies less space.
>
>
>
> Regards,
>
> Anand
>
>
>
> *From:* Alexander Alten-Lorenz [mailto:wget.null@gmail.com]
> *Sent:* Friday, April 24, 2015 2:49 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Large number of small files
>
>
>
> Marko,
>
>
>
> Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons wouldn't
> be discussed here.
>
>
>
> Parquet is an columnar based storage format. It is - high level - a bit
> like a NoSQL DB, but on the storage level. it allows users to "query" the
> data with MR, Pig or similar tools. Additionally, Parquet works perfectly
> with Hive and Cloudera Impala as well as Apache Dremel.
>
>
>
> https://parquet.incubator.apache.org/documentation/latest/
>
>
> http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html
>
>
> https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table
>
>
>
>
> --
>
> Alexander Alten-Lorenz
> m: wget.null@gmail.com
> b: mapredit.blogspot.com
>
>
>
>  On Apr 24, 2015, at 11:10 AM, Marko Dinic <ma...@nissatech.com>
> wrote:
>
>
>
> Anand,
>
> Thank you for your answer, but wouldn't that mean that I would have to
> serialize the files each time I need to run the job? And I would still need
> to save the original files, so the NameNode still needs to take care of
> them?
>
> Please correct me if I'm missing something, I'm not very experienced with
> Hadoop.
>
> What do you think about using Cassandra?
>
> Thanks
>
> On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan
> wrote:
>
>  Apart from databases like Cassandra, you may check serialization formats
> like Avro or Parquet
>
> Regards,
> Anand
>
> -----Original Message-----
> From: Marko Dinic [mailto:marko.dinic@nissatech.com
> <ma...@nissatech.com>]
> Sent: Friday, April 24, 2015 2:23 PM
> To: user@hadoop.apache.org
> Subject: Large number of small files
>
> Hello,
>
> I'm not sure if this is the place to ask this question, but I'm still
> hopping for an answer/advice.
>
> Large number of small files are uploaded, about 8KB. I am aware that this
> is not something that you're hopping for when working with Hadoop.
>
> I was thinking about using HAR files and combined input, or sequence
> files. The problem is, files are timestamped, and I need different subset
> in different time, for example - one job needs to run on files that are
> uploaded during last 3 months, while next job might consider last 6 months.
> Naturally, as time passes different subset of files is needed.
>
> This means that I would need to make a sequence file (or a HAR) each time
> I run a job, to have smaller number of mappers. On the other hand, I need
> the original files so I could subset them. This means that DataNode is at
> constant pressure, saving all of this in its memory.
>
> How can I solve this problem?
>
> I was also considering using Cassandra, or something like that, and to
> save the file content inside of it, instead of saving it to files on HDFS.
> FIle content is actually some measurement, that is, a vector of numbers,
> with some metadata.
>
> Thanks
>
>
>



-- 
Harshit Mathur

Re: Large number of small files

Posted by Marko Dinic <ma...@nissatech.com>.

Anand,

Thank you very much for the clarification. Can you please explain how 
would I be able to add new files to the parquet file? Since the files 
today won't be the same as the files that were used yesterday, since 
new files are added since yesterday?

Thanks,
Marko

On Fri 24 Apr 2015 11:33:03 AM CEST, Chandra Mohan, Ananda Vel Murugan 
wrote:
> Marko,
>
> Parquet file would be created once when you load the data. You don’t
> have to store your small files in HDFS just for the reason of
> subseting the data by time range. You can store data and metadata in
> same Parquet file. As already pointed out, parquet files work well
> other tools in Hadoop ecosystem. Apart from performance of your map
> reduce jobs, other aspect is storage efficiency. Serialization formats
> like Avro and Parquet provide better compression and hence data
> occupies less space.
>
> Regards,
>
> Anand
>
> *From:*Alexander Alten-Lorenz [mailto:wget.null@gmail.com]
> *Sent:* Friday, April 24, 2015 2:49 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Large number of small files
>
> Marko,
>
> Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons
> wouldn't be discussed here.
>
> Parquet is an columnar based storage format. It is - high level - a
> bit like a NoSQL DB, but on the storage level. it allows users to
> "query" the data with MR, Pig or similar tools. Additionally, Parquet
> works perfectly with Hive and Cloudera Impala as well as Apache Dremel.
>
> https://parquet.incubator.apache.org/documentation/latest/
>
> http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html
>
> https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table
>
>
> --
>
> Alexander Alten-Lorenz
> m: wget.null@gmail.com <ma...@gmail.com>
> b: mapredit.blogspot.com <http://mapredit.blogspot.com>
>
>     On Apr 24, 2015, at 11:10 AM, Marko Dinic
>     <marko.dinic@nissatech.com <ma...@nissatech.com>> wrote:
>
>     Anand,
>
>     Thank you for your answer, but wouldn't that mean that I would
>     have to serialize the files each time I need to run the job? And I
>     would still need to save the original files, so the NameNode still
>     needs to take care of them?
>
>     Please correct me if I'm missing something, I'm not very
>     experienced with Hadoop.
>
>     What do you think about using Cassandra?
>
>     Thanks
>
>     On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel
>     Murugan wrote:
>
>     Apart from databases like Cassandra, you may check serialization
>     formats like Avro or Parquet
>
>     Regards,
>     Anand
>
>     -----Original Message-----
>     From: Marko Dinic [mailto:marko.dinic@nissatech.com]
>     Sent: Friday, April 24, 2015 2:23 PM
>     To: user@hadoop.apache.org <ma...@hadoop.apache.org>
>     Subject: Large number of small files
>
>     Hello,
>
>     I'm not sure if this is the place to ask this question, but I'm
>     still hopping for an answer/advice.
>
>     Large number of small files are uploaded, about 8KB. I am aware
>     that this is not something that you're hopping for when working
>     with Hadoop.
>
>     I was thinking about using HAR files and combined input, or
>     sequence files. The problem is, files are timestamped, and I need
>     different subset in different time, for example - one job needs to
>     run on files that are uploaded during last 3 months, while next
>     job might consider last 6 months. Naturally, as time passes
>     different subset of files is needed.
>
>     This means that I would need to make a sequence file (or a HAR)
>     each time I run a job, to have smaller number of mappers. On the
>     other hand, I need the original files so I could subset them. This
>     means that DataNode is at constant pressure, saving all of this in
>     its memory.
>
>     How can I solve this problem?
>
>     I was also considering using Cassandra, or something like that,
>     and to save the file content inside of it, instead of saving it to
>     files on HDFS. FIle content is actually some measurement, that is,
>     a vector of numbers, with some metadata.
>
>     Thanks
>

Re: Large number of small files

Posted by Marko Dinic <ma...@nissatech.com>.

Anand,

Thank you very much for the clarification. Can you please explain how 
would I be able to add new files to the parquet file? Since the files 
today won't be the same as the files that were used yesterday, since 
new files are added since yesterday?

Thanks,
Marko

On Fri 24 Apr 2015 11:33:03 AM CEST, Chandra Mohan, Ananda Vel Murugan 
wrote:
> Marko,
>
> Parquet file would be created once when you load the data. You don’t
> have to store your small files in HDFS just for the reason of
> subseting the data by time range. You can store data and metadata in
> same Parquet file. As already pointed out, parquet files work well
> other tools in Hadoop ecosystem. Apart from performance of your map
> reduce jobs, other aspect is storage efficiency. Serialization formats
> like Avro and Parquet provide better compression and hence data
> occupies less space.
>
> Regards,
>
> Anand
>
> *From:*Alexander Alten-Lorenz [mailto:wget.null@gmail.com]
> *Sent:* Friday, April 24, 2015 2:49 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Large number of small files
>
> Marko,
>
> Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons
> wouldn't be discussed here.
>
> Parquet is an columnar based storage format. It is - high level - a
> bit like a NoSQL DB, but on the storage level. it allows users to
> "query" the data with MR, Pig or similar tools. Additionally, Parquet
> works perfectly with Hive and Cloudera Impala as well as Apache Dremel.
>
> https://parquet.incubator.apache.org/documentation/latest/
>
> http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html
>
> https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table
>
>
> --
>
> Alexander Alten-Lorenz
> m: wget.null@gmail.com <ma...@gmail.com>
> b: mapredit.blogspot.com <http://mapredit.blogspot.com>
>
>     On Apr 24, 2015, at 11:10 AM, Marko Dinic
>     <marko.dinic@nissatech.com <ma...@nissatech.com>> wrote:
>
>     Anand,
>
>     Thank you for your answer, but wouldn't that mean that I would
>     have to serialize the files each time I need to run the job? And I
>     would still need to save the original files, so the NameNode still
>     needs to take care of them?
>
>     Please correct me if I'm missing something, I'm not very
>     experienced with Hadoop.
>
>     What do you think about using Cassandra?
>
>     Thanks
>
>     On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel
>     Murugan wrote:
>
>     Apart from databases like Cassandra, you may check serialization
>     formats like Avro or Parquet
>
>     Regards,
>     Anand
>
>     -----Original Message-----
>     From: Marko Dinic [mailto:marko.dinic@nissatech.com]
>     Sent: Friday, April 24, 2015 2:23 PM
>     To: user@hadoop.apache.org <ma...@hadoop.apache.org>
>     Subject: Large number of small files
>
>     Hello,
>
>     I'm not sure if this is the place to ask this question, but I'm
>     still hopping for an answer/advice.
>
>     Large number of small files are uploaded, about 8KB. I am aware
>     that this is not something that you're hopping for when working
>     with Hadoop.
>
>     I was thinking about using HAR files and combined input, or
>     sequence files. The problem is, files are timestamped, and I need
>     different subset in different time, for example - one job needs to
>     run on files that are uploaded during last 3 months, while next
>     job might consider last 6 months. Naturally, as time passes
>     different subset of files is needed.
>
>     This means that I would need to make a sequence file (or a HAR)
>     each time I run a job, to have smaller number of mappers. On the
>     other hand, I need the original files so I could subset them. This
>     means that DataNode is at constant pressure, saving all of this in
>     its memory.
>
>     How can I solve this problem?
>
>     I was also considering using Cassandra, or something like that,
>     and to save the file content inside of it, instead of saving it to
>     files on HDFS. FIle content is actually some measurement, that is,
>     a vector of numbers, with some metadata.
>
>     Thanks
>

Re: Large number of small files

Posted by Marko Dinic <ma...@nissatech.com>.

Anand,

Thank you very much for the clarification. Can you please explain how 
would I be able to add new files to the parquet file? Since the files 
today won't be the same as the files that were used yesterday, since 
new files are added since yesterday?

Thanks,
Marko

On Fri 24 Apr 2015 11:33:03 AM CEST, Chandra Mohan, Ananda Vel Murugan 
wrote:
> Marko,
>
> Parquet file would be created once when you load the data. You don’t
> have to store your small files in HDFS just for the reason of
> subseting the data by time range. You can store data and metadata in
> same Parquet file. As already pointed out, parquet files work well
> other tools in Hadoop ecosystem. Apart from performance of your map
> reduce jobs, other aspect is storage efficiency. Serialization formats
> like Avro and Parquet provide better compression and hence data
> occupies less space.
>
> Regards,
>
> Anand
>
> *From:*Alexander Alten-Lorenz [mailto:wget.null@gmail.com]
> *Sent:* Friday, April 24, 2015 2:49 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Large number of small files
>
> Marko,
>
> Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons
> wouldn't be discussed here.
>
> Parquet is an columnar based storage format. It is - high level - a
> bit like a NoSQL DB, but on the storage level. it allows users to
> "query" the data with MR, Pig or similar tools. Additionally, Parquet
> works perfectly with Hive and Cloudera Impala as well as Apache Dremel.
>
> https://parquet.incubator.apache.org/documentation/latest/
>
> http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html
>
> https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table
>
>
> --
>
> Alexander Alten-Lorenz
> m: wget.null@gmail.com <ma...@gmail.com>
> b: mapredit.blogspot.com <http://mapredit.blogspot.com>
>
>     On Apr 24, 2015, at 11:10 AM, Marko Dinic
>     <marko.dinic@nissatech.com <ma...@nissatech.com>> wrote:
>
>     Anand,
>
>     Thank you for your answer, but wouldn't that mean that I would
>     have to serialize the files each time I need to run the job? And I
>     would still need to save the original files, so the NameNode still
>     needs to take care of them?
>
>     Please correct me if I'm missing something, I'm not very
>     experienced with Hadoop.
>
>     What do you think about using Cassandra?
>
>     Thanks
>
>     On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel
>     Murugan wrote:
>
>     Apart from databases like Cassandra, you may check serialization
>     formats like Avro or Parquet
>
>     Regards,
>     Anand
>
>     -----Original Message-----
>     From: Marko Dinic [mailto:marko.dinic@nissatech.com]
>     Sent: Friday, April 24, 2015 2:23 PM
>     To: user@hadoop.apache.org <ma...@hadoop.apache.org>
>     Subject: Large number of small files
>
>     Hello,
>
>     I'm not sure if this is the place to ask this question, but I'm
>     still hopping for an answer/advice.
>
>     Large number of small files are uploaded, about 8KB. I am aware
>     that this is not something that you're hopping for when working
>     with Hadoop.
>
>     I was thinking about using HAR files and combined input, or
>     sequence files. The problem is, files are timestamped, and I need
>     different subset in different time, for example - one job needs to
>     run on files that are uploaded during last 3 months, while next
>     job might consider last 6 months. Naturally, as time passes
>     different subset of files is needed.
>
>     This means that I would need to make a sequence file (or a HAR)
>     each time I run a job, to have smaller number of mappers. On the
>     other hand, I need the original files so I could subset them. This
>     means that DataNode is at constant pressure, saving all of this in
>     its memory.
>
>     How can I solve this problem?
>
>     I was also considering using Cassandra, or something like that,
>     and to save the file content inside of it, instead of saving it to
>     files on HDFS. FIle content is actually some measurement, that is,
>     a vector of numbers, with some metadata.
>
>     Thanks
>

Re: Large number of small files

Posted by Harshit Mathur <ma...@gmail.com>.

Use can use combine file input format, it will save the large number of
mappers. And as per my understanding of your problem that one of your job
will run of past 3 months of data while other might run on past 6 months
data.
For this you can run a pre processing job which will just output a file
having the paths of the files within your desired time range of 3 months or
6 months. Then use multiple file input format to pass all these paths in
your mapper.



On Fri, Apr 24, 2015 at 3:03 PM, Chandra Mohan, Ananda Vel Murugan <
Ananda.Murugan@honeywell.com> wrote:

>  Marko,
>
>
>
> Parquet file would be created once when you load the data. You don’t have
> to store your small files in HDFS just for the reason of subseting the data
> by time range. You can store data and metadata in same Parquet file. As
> already pointed out, parquet files work well other tools in Hadoop
> ecosystem. Apart from performance of your map reduce jobs, other aspect is
> storage efficiency. Serialization formats like Avro and Parquet provide
> better compression and hence data occupies less space.
>
>
>
> Regards,
>
> Anand
>
>
>
> *From:* Alexander Alten-Lorenz [mailto:wget.null@gmail.com]
> *Sent:* Friday, April 24, 2015 2:49 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Large number of small files
>
>
>
> Marko,
>
>
>
> Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons wouldn't
> be discussed here.
>
>
>
> Parquet is an columnar based storage format. It is - high level - a bit
> like a NoSQL DB, but on the storage level. it allows users to "query" the
> data with MR, Pig or similar tools. Additionally, Parquet works perfectly
> with Hive and Cloudera Impala as well as Apache Dremel.
>
>
>
> https://parquet.incubator.apache.org/documentation/latest/
>
>
> http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html
>
>
> https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table
>
>
>
>
> --
>
> Alexander Alten-Lorenz
> m: wget.null@gmail.com
> b: mapredit.blogspot.com
>
>
>
>  On Apr 24, 2015, at 11:10 AM, Marko Dinic <ma...@nissatech.com>
> wrote:
>
>
>
> Anand,
>
> Thank you for your answer, but wouldn't that mean that I would have to
> serialize the files each time I need to run the job? And I would still need
> to save the original files, so the NameNode still needs to take care of
> them?
>
> Please correct me if I'm missing something, I'm not very experienced with
> Hadoop.
>
> What do you think about using Cassandra?
>
> Thanks
>
> On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan
> wrote:
>
>  Apart from databases like Cassandra, you may check serialization formats
> like Avro or Parquet
>
> Regards,
> Anand
>
> -----Original Message-----
> From: Marko Dinic [mailto:marko.dinic@nissatech.com
> <ma...@nissatech.com>]
> Sent: Friday, April 24, 2015 2:23 PM
> To: user@hadoop.apache.org
> Subject: Large number of small files
>
> Hello,
>
> I'm not sure if this is the place to ask this question, but I'm still
> hopping for an answer/advice.
>
> Large number of small files are uploaded, about 8KB. I am aware that this
> is not something that you're hopping for when working with Hadoop.
>
> I was thinking about using HAR files and combined input, or sequence
> files. The problem is, files are timestamped, and I need different subset
> in different time, for example - one job needs to run on files that are
> uploaded during last 3 months, while next job might consider last 6 months.
> Naturally, as time passes different subset of files is needed.
>
> This means that I would need to make a sequence file (or a HAR) each time
> I run a job, to have smaller number of mappers. On the other hand, I need
> the original files so I could subset them. This means that DataNode is at
> constant pressure, saving all of this in its memory.
>
> How can I solve this problem?
>
> I was also considering using Cassandra, or something like that, and to
> save the file content inside of it, instead of saving it to files on HDFS.
> FIle content is actually some measurement, that is, a vector of numbers,
> with some metadata.
>
> Thanks
>
>
>



-- 
Harshit Mathur

Re: Large number of small files

Posted by Harshit Mathur <ma...@gmail.com>.

Use can use combine file input format, it will save the large number of
mappers. And as per my understanding of your problem that one of your job
will run of past 3 months of data while other might run on past 6 months
data.
For this you can run a pre processing job which will just output a file
having the paths of the files within your desired time range of 3 months or
6 months. Then use multiple file input format to pass all these paths in
your mapper.



On Fri, Apr 24, 2015 at 3:03 PM, Chandra Mohan, Ananda Vel Murugan <
Ananda.Murugan@honeywell.com> wrote:

>  Marko,
>
>
>
> Parquet file would be created once when you load the data. You don’t have
> to store your small files in HDFS just for the reason of subseting the data
> by time range. You can store data and metadata in same Parquet file. As
> already pointed out, parquet files work well other tools in Hadoop
> ecosystem. Apart from performance of your map reduce jobs, other aspect is
> storage efficiency. Serialization formats like Avro and Parquet provide
> better compression and hence data occupies less space.
>
>
>
> Regards,
>
> Anand
>
>
>
> *From:* Alexander Alten-Lorenz [mailto:wget.null@gmail.com]
> *Sent:* Friday, April 24, 2015 2:49 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Large number of small files
>
>
>
> Marko,
>
>
>
> Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons wouldn't
> be discussed here.
>
>
>
> Parquet is an columnar based storage format. It is - high level - a bit
> like a NoSQL DB, but on the storage level. it allows users to "query" the
> data with MR, Pig or similar tools. Additionally, Parquet works perfectly
> with Hive and Cloudera Impala as well as Apache Dremel.
>
>
>
> https://parquet.incubator.apache.org/documentation/latest/
>
>
> http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html
>
>
> https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table
>
>
>
>
> --
>
> Alexander Alten-Lorenz
> m: wget.null@gmail.com
> b: mapredit.blogspot.com
>
>
>
>  On Apr 24, 2015, at 11:10 AM, Marko Dinic <ma...@nissatech.com>
> wrote:
>
>
>
> Anand,
>
> Thank you for your answer, but wouldn't that mean that I would have to
> serialize the files each time I need to run the job? And I would still need
> to save the original files, so the NameNode still needs to take care of
> them?
>
> Please correct me if I'm missing something, I'm not very experienced with
> Hadoop.
>
> What do you think about using Cassandra?
>
> Thanks
>
> On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan
> wrote:
>
>  Apart from databases like Cassandra, you may check serialization formats
> like Avro or Parquet
>
> Regards,
> Anand
>
> -----Original Message-----
> From: Marko Dinic [mailto:marko.dinic@nissatech.com
> <ma...@nissatech.com>]
> Sent: Friday, April 24, 2015 2:23 PM
> To: user@hadoop.apache.org
> Subject: Large number of small files
>
> Hello,
>
> I'm not sure if this is the place to ask this question, but I'm still
> hopping for an answer/advice.
>
> Large number of small files are uploaded, about 8KB. I am aware that this
> is not something that you're hopping for when working with Hadoop.
>
> I was thinking about using HAR files and combined input, or sequence
> files. The problem is, files are timestamped, and I need different subset
> in different time, for example - one job needs to run on files that are
> uploaded during last 3 months, while next job might consider last 6 months.
> Naturally, as time passes different subset of files is needed.
>
> This means that I would need to make a sequence file (or a HAR) each time
> I run a job, to have smaller number of mappers. On the other hand, I need
> the original files so I could subset them. This means that DataNode is at
> constant pressure, saving all of this in its memory.
>
> How can I solve this problem?
>
> I was also considering using Cassandra, or something like that, and to
> save the file content inside of it, instead of saving it to files on HDFS.
> FIle content is actually some measurement, that is, a vector of numbers,
> with some metadata.
>
> Thanks
>
>
>



-- 
Harshit Mathur

Re: Large number of small files

Posted by Marko Dinic <ma...@nissatech.com>.

Anand,

Thank you very much for the clarification. Can you please explain how 
would I be able to add new files to the parquet file? Since the files 
today won't be the same as the files that were used yesterday, since 
new files are added since yesterday?

Thanks,
Marko

On Fri 24 Apr 2015 11:33:03 AM CEST, Chandra Mohan, Ananda Vel Murugan 
wrote:
> Marko,
>
> Parquet file would be created once when you load the data. You don’t
> have to store your small files in HDFS just for the reason of
> subseting the data by time range. You can store data and metadata in
> same Parquet file. As already pointed out, parquet files work well
> other tools in Hadoop ecosystem. Apart from performance of your map
> reduce jobs, other aspect is storage efficiency. Serialization formats
> like Avro and Parquet provide better compression and hence data
> occupies less space.
>
> Regards,
>
> Anand
>
> *From:*Alexander Alten-Lorenz [mailto:wget.null@gmail.com]
> *Sent:* Friday, April 24, 2015 2:49 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Large number of small files
>
> Marko,
>
> Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons
> wouldn't be discussed here.
>
> Parquet is an columnar based storage format. It is - high level - a
> bit like a NoSQL DB, but on the storage level. it allows users to
> "query" the data with MR, Pig or similar tools. Additionally, Parquet
> works perfectly with Hive and Cloudera Impala as well as Apache Dremel.
>
> https://parquet.incubator.apache.org/documentation/latest/
>
> http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html
>
> https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table
>
>
> --
>
> Alexander Alten-Lorenz
> m: wget.null@gmail.com <ma...@gmail.com>
> b: mapredit.blogspot.com <http://mapredit.blogspot.com>
>
>     On Apr 24, 2015, at 11:10 AM, Marko Dinic
>     <marko.dinic@nissatech.com <ma...@nissatech.com>> wrote:
>
>     Anand,
>
>     Thank you for your answer, but wouldn't that mean that I would
>     have to serialize the files each time I need to run the job? And I
>     would still need to save the original files, so the NameNode still
>     needs to take care of them?
>
>     Please correct me if I'm missing something, I'm not very
>     experienced with Hadoop.
>
>     What do you think about using Cassandra?
>
>     Thanks
>
>     On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel
>     Murugan wrote:
>
>     Apart from databases like Cassandra, you may check serialization
>     formats like Avro or Parquet
>
>     Regards,
>     Anand
>
>     -----Original Message-----
>     From: Marko Dinic [mailto:marko.dinic@nissatech.com]
>     Sent: Friday, April 24, 2015 2:23 PM
>     To: user@hadoop.apache.org <ma...@hadoop.apache.org>
>     Subject: Large number of small files
>
>     Hello,
>
>     I'm not sure if this is the place to ask this question, but I'm
>     still hopping for an answer/advice.
>
>     Large number of small files are uploaded, about 8KB. I am aware
>     that this is not something that you're hopping for when working
>     with Hadoop.
>
>     I was thinking about using HAR files and combined input, or
>     sequence files. The problem is, files are timestamped, and I need
>     different subset in different time, for example - one job needs to
>     run on files that are uploaded during last 3 months, while next
>     job might consider last 6 months. Naturally, as time passes
>     different subset of files is needed.
>
>     This means that I would need to make a sequence file (or a HAR)
>     each time I run a job, to have smaller number of mappers. On the
>     other hand, I need the original files so I could subset them. This
>     means that DataNode is at constant pressure, saving all of this in
>     its memory.
>
>     How can I solve this problem?
>
>     I was also considering using Cassandra, or something like that,
>     and to save the file content inside of it, instead of saving it to
>     files on HDFS. FIle content is actually some measurement, that is,
>     a vector of numbers, with some metadata.
>
>     Thanks
>

RE: Large number of small files

Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.

Marko,

Parquet file would be created once when you load the data. You don't have to store your small files in HDFS just for the reason of subseting the data by time range. You can store data and metadata in same Parquet file. As already pointed out, parquet files work well other tools in Hadoop ecosystem. Apart from performance of your map reduce jobs, other aspect is storage efficiency. Serialization formats like Avro and Parquet provide better compression and hence data occupies less space.

Regards,
Anand

From: Alexander Alten-Lorenz [mailto:wget.null@gmail.com]
Sent: Friday, April 24, 2015 2:49 PM
To: user@hadoop.apache.org
Subject: Re: Large number of small files

Marko,

Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons wouldn't be discussed here.

Parquet is an columnar based storage format. It is - high level - a bit like a NoSQL DB, but on the storage level. it allows users to "query" the data with MR, Pig or similar tools. Additionally, Parquet works perfectly with Hive and Cloudera Impala as well as Apache Dremel.

https://parquet.incubator.apache.org/documentation/latest/
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html
https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table

--
Alexander Alten-Lorenz
m: wget.null@gmail.com<ma...@gmail.com>
b: mapredit.blogspot.com<http://mapredit.blogspot.com>

On Apr 24, 2015, at 11:10 AM, Marko Dinic <ma...@nissatech.com>> wrote:

Anand,

Thank you for your answer, but wouldn't that mean that I would have to serialize the files each time I need to run the job? And I would still need to save the original files, so the NameNode still needs to take care of them?

Please correct me if I'm missing something, I'm not very experienced with Hadoop.

What do you think about using Cassandra?

Thanks

On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan wrote:

Apart from databases like Cassandra, you may check serialization formats like Avro or Parquet

Regards,
Anand

-----Original Message-----
From: Marko Dinic [mailto:marko.dinic@nissatech.com]
Sent: Friday, April 24, 2015 2:23 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Large number of small files

Hello,

I'm not sure if this is the place to ask this question, but I'm still hopping for an answer/advice.

Large number of small files are uploaded, about 8KB. I am aware that this is not something that you're hopping for when working with Hadoop.

I was thinking about using HAR files and combined input, or sequence files. The problem is, files are timestamped, and I need different subset in different time, for example - one job needs to run on files that are uploaded during last 3 months, while next job might consider last 6 months. Naturally, as time passes different subset of files is needed.

This means that I would need to make a sequence file (or a HAR) each time I run a job, to have smaller number of mappers. On the other hand, I need the original files so I could subset them. This means that DataNode is at constant pressure, saving all of this in its memory.

How can I solve this problem?

I was also considering using Cassandra, or something like that, and to save the file content inside of it, instead of saving it to files on HDFS. FIle content is actually some measurement, that is, a vector of numbers, with some metadata.

Thanks

RE: Large number of small files

Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.

Marko,

Parquet file would be created once when you load the data. You don't have to store your small files in HDFS just for the reason of subseting the data by time range. You can store data and metadata in same Parquet file. As already pointed out, parquet files work well other tools in Hadoop ecosystem. Apart from performance of your map reduce jobs, other aspect is storage efficiency. Serialization formats like Avro and Parquet provide better compression and hence data occupies less space.

Regards,
Anand

From: Alexander Alten-Lorenz [mailto:wget.null@gmail.com]
Sent: Friday, April 24, 2015 2:49 PM
To: user@hadoop.apache.org
Subject: Re: Large number of small files

Marko,

Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons wouldn't be discussed here.

Parquet is an columnar based storage format. It is - high level - a bit like a NoSQL DB, but on the storage level. it allows users to "query" the data with MR, Pig or similar tools. Additionally, Parquet works perfectly with Hive and Cloudera Impala as well as Apache Dremel.

https://parquet.incubator.apache.org/documentation/latest/
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html
https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table

--
Alexander Alten-Lorenz
m: wget.null@gmail.com<ma...@gmail.com>
b: mapredit.blogspot.com<http://mapredit.blogspot.com>

On Apr 24, 2015, at 11:10 AM, Marko Dinic <ma...@nissatech.com>> wrote:

Anand,

Thank you for your answer, but wouldn't that mean that I would have to serialize the files each time I need to run the job? And I would still need to save the original files, so the NameNode still needs to take care of them?

Please correct me if I'm missing something, I'm not very experienced with Hadoop.

What do you think about using Cassandra?

Thanks

On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan wrote:

Apart from databases like Cassandra, you may check serialization formats like Avro or Parquet

Regards,
Anand

-----Original Message-----
From: Marko Dinic [mailto:marko.dinic@nissatech.com]
Sent: Friday, April 24, 2015 2:23 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Large number of small files

Hello,

I'm not sure if this is the place to ask this question, but I'm still hopping for an answer/advice.

Large number of small files are uploaded, about 8KB. I am aware that this is not something that you're hopping for when working with Hadoop.

I was thinking about using HAR files and combined input, or sequence files. The problem is, files are timestamped, and I need different subset in different time, for example - one job needs to run on files that are uploaded during last 3 months, while next job might consider last 6 months. Naturally, as time passes different subset of files is needed.

This means that I would need to make a sequence file (or a HAR) each time I run a job, to have smaller number of mappers. On the other hand, I need the original files so I could subset them. This means that DataNode is at constant pressure, saving all of this in its memory.

How can I solve this problem?

I was also considering using Cassandra, or something like that, and to save the file content inside of it, instead of saving it to files on HDFS. FIle content is actually some measurement, that is, a vector of numbers, with some metadata.

Thanks

RE: Large number of small files

Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.

Marko,

Parquet file would be created once when you load the data. You don't have to store your small files in HDFS just for the reason of subseting the data by time range. You can store data and metadata in same Parquet file. As already pointed out, parquet files work well other tools in Hadoop ecosystem. Apart from performance of your map reduce jobs, other aspect is storage efficiency. Serialization formats like Avro and Parquet provide better compression and hence data occupies less space.

Regards,
Anand

From: Alexander Alten-Lorenz [mailto:wget.null@gmail.com]
Sent: Friday, April 24, 2015 2:49 PM
To: user@hadoop.apache.org
Subject: Re: Large number of small files

Marko,

Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons wouldn't be discussed here.

Parquet is an columnar based storage format. It is - high level - a bit like a NoSQL DB, but on the storage level. it allows users to "query" the data with MR, Pig or similar tools. Additionally, Parquet works perfectly with Hive and Cloudera Impala as well as Apache Dremel.

https://parquet.incubator.apache.org/documentation/latest/
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html
https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table

--
Alexander Alten-Lorenz
m: wget.null@gmail.com<ma...@gmail.com>
b: mapredit.blogspot.com<http://mapredit.blogspot.com>

On Apr 24, 2015, at 11:10 AM, Marko Dinic <ma...@nissatech.com>> wrote:

Anand,

Thank you for your answer, but wouldn't that mean that I would have to serialize the files each time I need to run the job? And I would still need to save the original files, so the NameNode still needs to take care of them?

Please correct me if I'm missing something, I'm not very experienced with Hadoop.

What do you think about using Cassandra?

Thanks

On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan wrote:

Apart from databases like Cassandra, you may check serialization formats like Avro or Parquet

Regards,
Anand

-----Original Message-----
From: Marko Dinic [mailto:marko.dinic@nissatech.com]
Sent: Friday, April 24, 2015 2:23 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Large number of small files

Hello,

I'm not sure if this is the place to ask this question, but I'm still hopping for an answer/advice.

Large number of small files are uploaded, about 8KB. I am aware that this is not something that you're hopping for when working with Hadoop.

I was thinking about using HAR files and combined input, or sequence files. The problem is, files are timestamped, and I need different subset in different time, for example - one job needs to run on files that are uploaded during last 3 months, while next job might consider last 6 months. Naturally, as time passes different subset of files is needed.

This means that I would need to make a sequence file (or a HAR) each time I run a job, to have smaller number of mappers. On the other hand, I need the original files so I could subset them. This means that DataNode is at constant pressure, saving all of this in its memory.

How can I solve this problem?

I was also considering using Cassandra, or something like that, and to save the file content inside of it, instead of saving it to files on HDFS. FIle content is actually some measurement, that is, a vector of numbers, with some metadata.

Thanks

RE: Large number of small files

Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.

Marko,

Parquet file would be created once when you load the data. You don't have to store your small files in HDFS just for the reason of subseting the data by time range. You can store data and metadata in same Parquet file. As already pointed out, parquet files work well other tools in Hadoop ecosystem. Apart from performance of your map reduce jobs, other aspect is storage efficiency. Serialization formats like Avro and Parquet provide better compression and hence data occupies less space.

Regards,
Anand

From: Alexander Alten-Lorenz [mailto:wget.null@gmail.com]
Sent: Friday, April 24, 2015 2:49 PM
To: user@hadoop.apache.org
Subject: Re: Large number of small files

Marko,

Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons wouldn't be discussed here.

Parquet is an columnar based storage format. It is - high level - a bit like a NoSQL DB, but on the storage level. it allows users to "query" the data with MR, Pig or similar tools. Additionally, Parquet works perfectly with Hive and Cloudera Impala as well as Apache Dremel.

https://parquet.incubator.apache.org/documentation/latest/
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html
https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table

--
Alexander Alten-Lorenz
m: wget.null@gmail.com<ma...@gmail.com>
b: mapredit.blogspot.com<http://mapredit.blogspot.com>

On Apr 24, 2015, at 11:10 AM, Marko Dinic <ma...@nissatech.com>> wrote:

Anand,

Thank you for your answer, but wouldn't that mean that I would have to serialize the files each time I need to run the job? And I would still need to save the original files, so the NameNode still needs to take care of them?

Please correct me if I'm missing something, I'm not very experienced with Hadoop.

What do you think about using Cassandra?

Thanks

On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan wrote:

Apart from databases like Cassandra, you may check serialization formats like Avro or Parquet

Regards,
Anand

-----Original Message-----
From: Marko Dinic [mailto:marko.dinic@nissatech.com]
Sent: Friday, April 24, 2015 2:23 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Large number of small files

Hello,

I'm not sure if this is the place to ask this question, but I'm still hopping for an answer/advice.

Large number of small files are uploaded, about 8KB. I am aware that this is not something that you're hopping for when working with Hadoop.

I was thinking about using HAR files and combined input, or sequence files. The problem is, files are timestamped, and I need different subset in different time, for example - one job needs to run on files that are uploaded during last 3 months, while next job might consider last 6 months. Naturally, as time passes different subset of files is needed.

This means that I would need to make a sequence file (or a HAR) each time I run a job, to have smaller number of mappers. On the other hand, I need the original files so I could subset them. This means that DataNode is at constant pressure, saving all of this in its memory.

How can I solve this problem?

I was also considering using Cassandra, or something like that, and to save the file content inside of it, instead of saving it to files on HDFS. FIle content is actually some measurement, that is, a vector of numbers, with some metadata.

Thanks

Re: Large number of small files

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.

Marko,

Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons wouldn't be discussed here.

Parquet is an columnar based storage format. It is - high level - a bit like a NoSQL DB, but on the storage level. it allows users to "query" the data with MR, Pig or similar tools. Additionally, Parquet works perfectly with Hive and Cloudera Impala as well as Apache Dremel.

https://parquet.incubator.apache.org/documentation/latest/ <https://parquet.incubator.apache.org/documentation/latest/>
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html <http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html>
https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table <https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table>


--
Alexander Alten-Lorenz
m: wget.null@gmail.com
b: mapredit.blogspot.com

> On Apr 24, 2015, at 11:10 AM, Marko Dinic <ma...@nissatech.com> wrote:
> 
> Anand,
> 
> Thank you for your answer, but wouldn't that mean that I would have to serialize the files each time I need to run the job? And I would still need to save the original files, so the NameNode still needs to take care of them?
> 
> Please correct me if I'm missing something, I'm not very experienced with Hadoop.
> 
> What do you think about using Cassandra?
> 
> Thanks
> 
> On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan wrote:
>> Apart from databases like Cassandra, you may check serialization formats like Avro or Parquet
>> 
>> Regards,
>> Anand
>> 
>> -----Original Message-----
>> From: Marko Dinic [mailto:marko.dinic@nissatech.com]
>> Sent: Friday, April 24, 2015 2:23 PM
>> To: user@hadoop.apache.org
>> Subject: Large number of small files
>> 
>> Hello,
>> 
>> I'm not sure if this is the place to ask this question, but I'm still hopping for an answer/advice.
>> 
>> Large number of small files are uploaded, about 8KB. I am aware that this is not something that you're hopping for when working with Hadoop.
>> 
>> I was thinking about using HAR files and combined input, or sequence files. The problem is, files are timestamped, and I need different subset in different time, for example - one job needs to run on files that are uploaded during last 3 months, while next job might consider last 6 months. Naturally, as time passes different subset of files is needed.
>> 
>> This means that I would need to make a sequence file (or a HAR) each time I run a job, to have smaller number of mappers. On the other hand, I need the original files so I could subset them. This means that DataNode is at constant pressure, saving all of this in its memory.
>> 
>> How can I solve this problem?
>> 
>> I was also considering using Cassandra, or something like that, and to save the file content inside of it, instead of saving it to files on HDFS. FIle content is actually some measurement, that is, a vector of numbers, with some metadata.
>> 
>> Thanks

Re: Large number of small files

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.

Marko,

Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons wouldn't be discussed here.

Parquet is an columnar based storage format. It is - high level - a bit like a NoSQL DB, but on the storage level. it allows users to "query" the data with MR, Pig or similar tools. Additionally, Parquet works perfectly with Hive and Cloudera Impala as well as Apache Dremel.

https://parquet.incubator.apache.org/documentation/latest/ <https://parquet.incubator.apache.org/documentation/latest/>
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html <http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html>
https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table <https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table>


--
Alexander Alten-Lorenz
m: wget.null@gmail.com
b: mapredit.blogspot.com

> On Apr 24, 2015, at 11:10 AM, Marko Dinic <ma...@nissatech.com> wrote:
> 
> Anand,
> 
> Thank you for your answer, but wouldn't that mean that I would have to serialize the files each time I need to run the job? And I would still need to save the original files, so the NameNode still needs to take care of them?
> 
> Please correct me if I'm missing something, I'm not very experienced with Hadoop.
> 
> What do you think about using Cassandra?
> 
> Thanks
> 
> On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan wrote:
>> Apart from databases like Cassandra, you may check serialization formats like Avro or Parquet
>> 
>> Regards,
>> Anand
>> 
>> -----Original Message-----
>> From: Marko Dinic [mailto:marko.dinic@nissatech.com]
>> Sent: Friday, April 24, 2015 2:23 PM
>> To: user@hadoop.apache.org
>> Subject: Large number of small files
>> 
>> Hello,
>> 
>> I'm not sure if this is the place to ask this question, but I'm still hopping for an answer/advice.
>> 
>> Large number of small files are uploaded, about 8KB. I am aware that this is not something that you're hopping for when working with Hadoop.
>> 
>> I was thinking about using HAR files and combined input, or sequence files. The problem is, files are timestamped, and I need different subset in different time, for example - one job needs to run on files that are uploaded during last 3 months, while next job might consider last 6 months. Naturally, as time passes different subset of files is needed.
>> 
>> This means that I would need to make a sequence file (or a HAR) each time I run a job, to have smaller number of mappers. On the other hand, I need the original files so I could subset them. This means that DataNode is at constant pressure, saving all of this in its memory.
>> 
>> How can I solve this problem?
>> 
>> I was also considering using Cassandra, or something like that, and to save the file content inside of it, instead of saving it to files on HDFS. FIle content is actually some measurement, that is, a vector of numbers, with some metadata.
>> 
>> Thanks

Re: Large number of small files

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.

Marko,

Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons wouldn't be discussed here.

Parquet is an columnar based storage format. It is - high level - a bit like a NoSQL DB, but on the storage level. it allows users to "query" the data with MR, Pig or similar tools. Additionally, Parquet works perfectly with Hive and Cloudera Impala as well as Apache Dremel.

https://parquet.incubator.apache.org/documentation/latest/ <https://parquet.incubator.apache.org/documentation/latest/>
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html <http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html>
https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table <https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table>


--
Alexander Alten-Lorenz
m: wget.null@gmail.com
b: mapredit.blogspot.com

> On Apr 24, 2015, at 11:10 AM, Marko Dinic <ma...@nissatech.com> wrote:
> 
> Anand,
> 
> Thank you for your answer, but wouldn't that mean that I would have to serialize the files each time I need to run the job? And I would still need to save the original files, so the NameNode still needs to take care of them?
> 
> Please correct me if I'm missing something, I'm not very experienced with Hadoop.
> 
> What do you think about using Cassandra?
> 
> Thanks
> 
> On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan wrote:
>> Apart from databases like Cassandra, you may check serialization formats like Avro or Parquet
>> 
>> Regards,
>> Anand
>> 
>> -----Original Message-----
>> From: Marko Dinic [mailto:marko.dinic@nissatech.com]
>> Sent: Friday, April 24, 2015 2:23 PM
>> To: user@hadoop.apache.org
>> Subject: Large number of small files
>> 
>> Hello,
>> 
>> I'm not sure if this is the place to ask this question, but I'm still hopping for an answer/advice.
>> 
>> Large number of small files are uploaded, about 8KB. I am aware that this is not something that you're hopping for when working with Hadoop.
>> 
>> I was thinking about using HAR files and combined input, or sequence files. The problem is, files are timestamped, and I need different subset in different time, for example - one job needs to run on files that are uploaded during last 3 months, while next job might consider last 6 months. Naturally, as time passes different subset of files is needed.
>> 
>> This means that I would need to make a sequence file (or a HAR) each time I run a job, to have smaller number of mappers. On the other hand, I need the original files so I could subset them. This means that DataNode is at constant pressure, saving all of this in its memory.
>> 
>> How can I solve this problem?
>> 
>> I was also considering using Cassandra, or something like that, and to save the file content inside of it, instead of saving it to files on HDFS. FIle content is actually some measurement, that is, a vector of numbers, with some metadata.
>> 
>> Thanks

Re: Large number of small files

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.

Marko,

Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons wouldn't be discussed here.

Parquet is an columnar based storage format. It is - high level - a bit like a NoSQL DB, but on the storage level. it allows users to "query" the data with MR, Pig or similar tools. Additionally, Parquet works perfectly with Hive and Cloudera Impala as well as Apache Dremel.

https://parquet.incubator.apache.org/documentation/latest/ <https://parquet.incubator.apache.org/documentation/latest/>
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html <http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html>
https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table <https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table>


--
Alexander Alten-Lorenz
m: wget.null@gmail.com
b: mapredit.blogspot.com

> On Apr 24, 2015, at 11:10 AM, Marko Dinic <ma...@nissatech.com> wrote:
> 
> Anand,
> 
> Thank you for your answer, but wouldn't that mean that I would have to serialize the files each time I need to run the job? And I would still need to save the original files, so the NameNode still needs to take care of them?
> 
> Please correct me if I'm missing something, I'm not very experienced with Hadoop.
> 
> What do you think about using Cassandra?
> 
> Thanks
> 
> On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan wrote:
>> Apart from databases like Cassandra, you may check serialization formats like Avro or Parquet
>> 
>> Regards,
>> Anand
>> 
>> -----Original Message-----
>> From: Marko Dinic [mailto:marko.dinic@nissatech.com]
>> Sent: Friday, April 24, 2015 2:23 PM
>> To: user@hadoop.apache.org
>> Subject: Large number of small files
>> 
>> Hello,
>> 
>> I'm not sure if this is the place to ask this question, but I'm still hopping for an answer/advice.
>> 
>> Large number of small files are uploaded, about 8KB. I am aware that this is not something that you're hopping for when working with Hadoop.
>> 
>> I was thinking about using HAR files and combined input, or sequence files. The problem is, files are timestamped, and I need different subset in different time, for example - one job needs to run on files that are uploaded during last 3 months, while next job might consider last 6 months. Naturally, as time passes different subset of files is needed.
>> 
>> This means that I would need to make a sequence file (or a HAR) each time I run a job, to have smaller number of mappers. On the other hand, I need the original files so I could subset them. This means that DataNode is at constant pressure, saving all of this in its memory.
>> 
>> How can I solve this problem?
>> 
>> I was also considering using Cassandra, or something like that, and to save the file content inside of it, instead of saving it to files on HDFS. FIle content is actually some measurement, that is, a vector of numbers, with some metadata.
>> 
>> Thanks

Re: Large number of small files

Posted by Marko Dinic <ma...@nissatech.com>.

Anand,

Thank you for your answer, but wouldn't that mean that I would have to 
serialize the files each time I need to run the job? And I would still 
need to save the original files, so the NameNode still needs to take 
care of them?

Please correct me if I'm missing something, I'm not very experienced 
with Hadoop.

What do you think about using Cassandra?

Thanks

On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan 
wrote:
> Apart from databases like Cassandra, you may check serialization formats like Avro or Parquet
>
> Regards,
> Anand
>
> -----Original Message-----
> From: Marko Dinic [mailto:marko.dinic@nissatech.com]
> Sent: Friday, April 24, 2015 2:23 PM
> To: user@hadoop.apache.org
> Subject: Large number of small files
>
> Hello,
>
> I'm not sure if this is the place to ask this question, but I'm still hopping for an answer/advice.
>
> Large number of small files are uploaded, about 8KB. I am aware that this is not something that you're hopping for when working with Hadoop.
>
> I was thinking about using HAR files and combined input, or sequence files. The problem is, files are timestamped, and I need different subset in different time, for example - one job needs to run on files that are uploaded during last 3 months, while next job might consider last 6 months. Naturally, as time passes different subset of files is needed.
>
> This means that I would need to make a sequence file (or a HAR) each time I run a job, to have smaller number of mappers. On the other hand, I need the original files so I could subset them. This means that DataNode is at constant pressure, saving all of this in its memory.
>
> How can I solve this problem?
>
> I was also considering using Cassandra, or something like that, and to save the file content inside of it, instead of saving it to files on HDFS. FIle content is actually some measurement, that is, a vector of numbers, with some metadata.
>
> Thanks

Re: Large number of small files

Posted by Marko Dinic <ma...@nissatech.com>.

Anand,

Thank you for your answer, but wouldn't that mean that I would have to 
serialize the files each time I need to run the job? And I would still 
need to save the original files, so the NameNode still needs to take 
care of them?

Please correct me if I'm missing something, I'm not very experienced 
with Hadoop.

What do you think about using Cassandra?

Thanks

On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan 
wrote:
> Apart from databases like Cassandra, you may check serialization formats like Avro or Parquet
>
> Regards,
> Anand
>
> -----Original Message-----
> From: Marko Dinic [mailto:marko.dinic@nissatech.com]
> Sent: Friday, April 24, 2015 2:23 PM
> To: user@hadoop.apache.org
> Subject: Large number of small files
>
> Hello,
>
> I'm not sure if this is the place to ask this question, but I'm still hopping for an answer/advice.
>
> Large number of small files are uploaded, about 8KB. I am aware that this is not something that you're hopping for when working with Hadoop.
>
> I was thinking about using HAR files and combined input, or sequence files. The problem is, files are timestamped, and I need different subset in different time, for example - one job needs to run on files that are uploaded during last 3 months, while next job might consider last 6 months. Naturally, as time passes different subset of files is needed.
>
> This means that I would need to make a sequence file (or a HAR) each time I run a job, to have smaller number of mappers. On the other hand, I need the original files so I could subset them. This means that DataNode is at constant pressure, saving all of this in its memory.
>
> How can I solve this problem?
>
> I was also considering using Cassandra, or something like that, and to save the file content inside of it, instead of saving it to files on HDFS. FIle content is actually some measurement, that is, a vector of numbers, with some metadata.
>
> Thanks

Re: Large number of small files

Posted by Marko Dinic <ma...@nissatech.com>.

Anand,

Thank you for your answer, but wouldn't that mean that I would have to 
serialize the files each time I need to run the job? And I would still 
need to save the original files, so the NameNode still needs to take 
care of them?

Please correct me if I'm missing something, I'm not very experienced 
with Hadoop.

What do you think about using Cassandra?

Thanks

On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan 
wrote:
> Apart from databases like Cassandra, you may check serialization formats like Avro or Parquet
>
> Regards,
> Anand
>
> -----Original Message-----
> From: Marko Dinic [mailto:marko.dinic@nissatech.com]
> Sent: Friday, April 24, 2015 2:23 PM
> To: user@hadoop.apache.org
> Subject: Large number of small files
>
> Hello,
>
> I'm not sure if this is the place to ask this question, but I'm still hopping for an answer/advice.
>
> Large number of small files are uploaded, about 8KB. I am aware that this is not something that you're hopping for when working with Hadoop.
>
> I was thinking about using HAR files and combined input, or sequence files. The problem is, files are timestamped, and I need different subset in different time, for example - one job needs to run on files that are uploaded during last 3 months, while next job might consider last 6 months. Naturally, as time passes different subset of files is needed.
>
> This means that I would need to make a sequence file (or a HAR) each time I run a job, to have smaller number of mappers. On the other hand, I need the original files so I could subset them. This means that DataNode is at constant pressure, saving all of this in its memory.
>
> How can I solve this problem?
>
> I was also considering using Cassandra, or something like that, and to save the file content inside of it, instead of saving it to files on HDFS. FIle content is actually some measurement, that is, a vector of numbers, with some metadata.
>
> Thanks

Re: Large number of small files

Posted by Marko Dinic <ma...@nissatech.com>.

Anand,

Thank you for your answer, but wouldn't that mean that I would have to 
serialize the files each time I need to run the job? And I would still 
need to save the original files, so the NameNode still needs to take 
care of them?

Please correct me if I'm missing something, I'm not very experienced 
with Hadoop.

What do you think about using Cassandra?

Thanks

On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan 
wrote:
> Apart from databases like Cassandra, you may check serialization formats like Avro or Parquet
>
> Regards,
> Anand
>
> -----Original Message-----
> From: Marko Dinic [mailto:marko.dinic@nissatech.com]
> Sent: Friday, April 24, 2015 2:23 PM
> To: user@hadoop.apache.org
> Subject: Large number of small files
>
> Hello,
>
> I'm not sure if this is the place to ask this question, but I'm still hopping for an answer/advice.
>
> Large number of small files are uploaded, about 8KB. I am aware that this is not something that you're hopping for when working with Hadoop.
>
> I was thinking about using HAR files and combined input, or sequence files. The problem is, files are timestamped, and I need different subset in different time, for example - one job needs to run on files that are uploaded during last 3 months, while next job might consider last 6 months. Naturally, as time passes different subset of files is needed.
>
> This means that I would need to make a sequence file (or a HAR) each time I run a job, to have smaller number of mappers. On the other hand, I need the original files so I could subset them. This means that DataNode is at constant pressure, saving all of this in its memory.
>
> How can I solve this problem?
>
> I was also considering using Cassandra, or something like that, and to save the file content inside of it, instead of saving it to files on HDFS. FIle content is actually some measurement, that is, a vector of numbers, with some metadata.
>
> Thanks

RE: Large number of small files

Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.

Apart from databases like Cassandra, you may check serialization formats like Avro or Parquet

Regards,
Anand

-----Original Message-----
From: Marko Dinic [mailto:marko.dinic@nissatech.com] 
Sent: Friday, April 24, 2015 2:23 PM
To: user@hadoop.apache.org
Subject: Large number of small files

Hello,

I'm not sure if this is the place to ask this question, but I'm still hopping for an answer/advice.

Large number of small files are uploaded, about 8KB. I am aware that this is not something that you're hopping for when working with Hadoop.

I was thinking about using HAR files and combined input, or sequence files. The problem is, files are timestamped, and I need different subset in different time, for example - one job needs to run on files that are uploaded during last 3 months, while next job might consider last 6 months. Naturally, as time passes different subset of files is needed.

This means that I would need to make a sequence file (or a HAR) each time I run a job, to have smaller number of mappers. On the other hand, I need the original files so I could subset them. This means that DataNode is at constant pressure, saving all of this in its memory.

How can I solve this problem?

I was also considering using Cassandra, or something like that, and to save the file content inside of it, instead of saving it to files on HDFS. FIle content is actually some measurement, that is, a vector of numbers, with some metadata.

Thanks

RE: Large number of small files

Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.

Apart from databases like Cassandra, you may check serialization formats like Avro or Parquet

Regards,
Anand

-----Original Message-----
From: Marko Dinic [mailto:marko.dinic@nissatech.com] 
Sent: Friday, April 24, 2015 2:23 PM
To: user@hadoop.apache.org
Subject: Large number of small files

Hello,

I'm not sure if this is the place to ask this question, but I'm still hopping for an answer/advice.

Large number of small files are uploaded, about 8KB. I am aware that this is not something that you're hopping for when working with Hadoop.

I was thinking about using HAR files and combined input, or sequence files. The problem is, files are timestamped, and I need different subset in different time, for example - one job needs to run on files that are uploaded during last 3 months, while next job might consider last 6 months. Naturally, as time passes different subset of files is needed.

This means that I would need to make a sequence file (or a HAR) each time I run a job, to have smaller number of mappers. On the other hand, I need the original files so I could subset them. This means that DataNode is at constant pressure, saving all of this in its memory.

How can I solve this problem?

I was also considering using Cassandra, or something like that, and to save the file content inside of it, instead of saving it to files on HDFS. FIle content is actually some measurement, that is, a vector of numbers, with some metadata.

Thanks

RE: Large number of small files

Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.

Apart from databases like Cassandra, you may check serialization formats like Avro or Parquet

Regards,
Anand

-----Original Message-----
From: Marko Dinic [mailto:marko.dinic@nissatech.com] 
Sent: Friday, April 24, 2015 2:23 PM
To: user@hadoop.apache.org
Subject: Large number of small files

Hello,

I'm not sure if this is the place to ask this question, but I'm still hopping for an answer/advice.

Large number of small files are uploaded, about 8KB. I am aware that this is not something that you're hopping for when working with Hadoop.

I was thinking about using HAR files and combined input, or sequence files. The problem is, files are timestamped, and I need different subset in different time, for example - one job needs to run on files that are uploaded during last 3 months, while next job might consider last 6 months. Naturally, as time passes different subset of files is needed.

This means that I would need to make a sequence file (or a HAR) each time I run a job, to have smaller number of mappers. On the other hand, I need the original files so I could subset them. This means that DataNode is at constant pressure, saving all of this in its memory.

How can I solve this problem?

I was also considering using Cassandra, or something like that, and to save the file content inside of it, instead of saving it to files on HDFS. FIle content is actually some measurement, that is, a vector of numbers, with some metadata.

Thanks

RE: Large number of small files

Posted by "Chandra Mohan, Ananda Vel Murugan" <An...@honeywell.com>.

Apart from databases like Cassandra, you may check serialization formats like Avro or Parquet

Regards,
Anand

-----Original Message-----
From: Marko Dinic [mailto:marko.dinic@nissatech.com] 
Sent: Friday, April 24, 2015 2:23 PM
To: user@hadoop.apache.org
Subject: Large number of small files

Hello,

I'm not sure if this is the place to ask this question, but I'm still hopping for an answer/advice.

Large number of small files are uploaded, about 8KB. I am aware that this is not something that you're hopping for when working with Hadoop.

I was thinking about using HAR files and combined input, or sequence files. The problem is, files are timestamped, and I need different subset in different time, for example - one job needs to run on files that are uploaded during last 3 months, while next job might consider last 6 months. Naturally, as time passes different subset of files is needed.

This means that I would need to make a sequence file (or a HAR) each time I run a job, to have smaller number of mappers. On the other hand, I need the original files so I could subset them. This means that DataNode is at constant pressure, saving all of this in its memory.

How can I solve this problem?

I was also considering using Cassandra, or something like that, and to save the file content inside of it, instead of saving it to files on HDFS. FIle content is actually some measurement, that is, a vector of numbers, with some metadata.

Thanks

Re: Large number of small files

Posted by Takenori Sato <ts...@cloudian.com>.

Hi Marko,

I think there are two major problems you should care.

1. name node memory
2. job overhead

To avoid 1, I suggest to store your data to external file system like
HBase, S3, or Cassandra, not HDFS.
For details, please refer to the following.
https://www.usenix.org/legacy/publications/login/2010-04/openpdfs/shvachko.pdf

For 2, you may want to use a higher level language like pig,
which will automatically combine a bunch of your small inputs into a larger
one.
http://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html

Thanks,
Takenori

On Fri, Apr 24, 2015 at 5:53 PM, Marko Dinic <ma...@nissatech.com>
wrote:

> Hello,
>
> I'm not sure if this is the place to ask this question, but I'm still
> hopping for an answer/advice.
>
> Large number of small files are uploaded, about 8KB. I am aware that this
> is not something that you're hopping for when working with Hadoop.
>
> I was thinking about using HAR files and combined input, or sequence
> files. The problem is, files are timestamped, and I need different subset
> in different time, for example - one job needs to run on files that are
> uploaded during last 3 months, while next job might consider last 6 months.
> Naturally, as time passes different subset of files is needed.
>
> This means that I would need to make a sequence file (or a HAR) each time
> I run a job, to have smaller number of mappers. On the other hand, I need
> the original files so I could subset them. This means that DataNode is at
> constant pressure, saving all of this in its memory.
>
> How can I solve this problem?
>
> I was also considering using Cassandra, or something like that, and to
> save the file content inside of it, instead of saving it to files on HDFS.
> FIle content is actually some measurement, that is, a vector of numbers,
> with some metadata.
>
> Thanks
>

Re: Large number of small files

Posted by Takenori Sato <ts...@cloudian.com>.

Hi Marko,

I think there are two major problems you should care.

1. name node memory
2. job overhead

To avoid 1, I suggest to store your data to external file system like
HBase, S3, or Cassandra, not HDFS.
For details, please refer to the following.
https://www.usenix.org/legacy/publications/login/2010-04/openpdfs/shvachko.pdf

For 2, you may want to use a higher level language like pig,
which will automatically combine a bunch of your small inputs into a larger
one.
http://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html

Thanks,
Takenori

On Fri, Apr 24, 2015 at 5:53 PM, Marko Dinic <ma...@nissatech.com>
wrote:

> Hello,
>
> I'm not sure if this is the place to ask this question, but I'm still
> hopping for an answer/advice.
>
> Large number of small files are uploaded, about 8KB. I am aware that this
> is not something that you're hopping for when working with Hadoop.
>
> I was thinking about using HAR files and combined input, or sequence
> files. The problem is, files are timestamped, and I need different subset
> in different time, for example - one job needs to run on files that are
> uploaded during last 3 months, while next job might consider last 6 months.
> Naturally, as time passes different subset of files is needed.
>
> This means that I would need to make a sequence file (or a HAR) each time
> I run a job, to have smaller number of mappers. On the other hand, I need
> the original files so I could subset them. This means that DataNode is at
> constant pressure, saving all of this in its memory.
>
> How can I solve this problem?
>
> I was also considering using Cassandra, or something like that, and to
> save the file content inside of it, instead of saving it to files on HDFS.
> FIle content is actually some measurement, that is, a vector of numbers,
> with some metadata.
>
> Thanks
>

Re: Large number of small files

Posted by Takenori Sato <ts...@cloudian.com>.

Hi Marko,

I think there are two major problems you should care.

1. name node memory
2. job overhead

To avoid 1, I suggest to store your data to external file system like
HBase, S3, or Cassandra, not HDFS.
For details, please refer to the following.
https://www.usenix.org/legacy/publications/login/2010-04/openpdfs/shvachko.pdf

For 2, you may want to use a higher level language like pig,
which will automatically combine a bunch of your small inputs into a larger
one.
http://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html

Thanks,
Takenori

On Fri, Apr 24, 2015 at 5:53 PM, Marko Dinic <ma...@nissatech.com>
wrote:

> Hello,
>
> I'm not sure if this is the place to ask this question, but I'm still
> hopping for an answer/advice.
>
> Large number of small files are uploaded, about 8KB. I am aware that this
> is not something that you're hopping for when working with Hadoop.
>
> I was thinking about using HAR files and combined input, or sequence
> files. The problem is, files are timestamped, and I need different subset
> in different time, for example - one job needs to run on files that are
> uploaded during last 3 months, while next job might consider last 6 months.
> Naturally, as time passes different subset of files is needed.
>
> This means that I would need to make a sequence file (or a HAR) each time
> I run a job, to have smaller number of mappers. On the other hand, I need
> the original files so I could subset them. This means that DataNode is at
> constant pressure, saving all of this in its memory.
>
> How can I solve this problem?
>
> I was also considering using Cassandra, or something like that, and to
> save the file content inside of it, instead of saving it to files on HDFS.
> FIle content is actually some measurement, that is, a vector of numbers,
> with some metadata.
>
> Thanks
>

Re: Large number of small files

Posted by Takenori Sato <ts...@cloudian.com>.

Hi Marko,

I think there are two major problems you should care.

1. name node memory
2. job overhead

To avoid 1, I suggest to store your data to external file system like
HBase, S3, or Cassandra, not HDFS.
For details, please refer to the following.
https://www.usenix.org/legacy/publications/login/2010-04/openpdfs/shvachko.pdf

For 2, you may want to use a higher level language like pig,
which will automatically combine a bunch of your small inputs into a larger
one.
http://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html

Thanks,
Takenori

On Fri, Apr 24, 2015 at 5:53 PM, Marko Dinic <ma...@nissatech.com>
wrote:

> Hello,
>
> I'm not sure if this is the place to ask this question, but I'm still
> hopping for an answer/advice.
>
> Large number of small files are uploaded, about 8KB. I am aware that this
> is not something that you're hopping for when working with Hadoop.
>
> I was thinking about using HAR files and combined input, or sequence
> files. The problem is, files are timestamped, and I need different subset
> in different time, for example - one job needs to run on files that are
> uploaded during last 3 months, while next job might consider last 6 months.
> Naturally, as time passes different subset of files is needed.
>
> This means that I would need to make a sequence file (or a HAR) each time
> I run a job, to have smaller number of mappers. On the other hand, I need
> the original files so I could subset them. This means that DataNode is at
> constant pressure, saving all of this in its memory.
>
> How can I solve this problem?
>
> I was also considering using Cassandra, or something like that, and to
> save the file content inside of it, instead of saving it to files on HDFS.
> FIle content is actually some measurement, that is, a vector of numbers,
> with some metadata.
>
> Thanks
>