You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Roger Maillist <da...@gmail.com> on 2014/10/02 10:12:11 UTC

HDFS - many files, small size

Hi there
I got millions of rather small PDF-Files which I want to load into HDFS for
later analysis. Also I need to re-encode them as base64-stream to get the
MR-Job for parsing work.

Is there any better/faster method of just calling the 'put' function in a
huge (bash) loop? Maybe I could implement encoding and loading as an MR-Job
itself?

Second thing is, according to a cloudera blog I read, it's a bad idea to
store small files on HDFS, especially if there are large numbers of them.
They recommend HBase instead. However I want to take further action via
HCatalog...

Thanks for your Suggestions
Roger

Re: HDFS - many files, small size

Posted by Mirko Kämpf <mi...@gmail.com>.
Hi Roger,

you can use Apache Flume to ingest this files into your cluster. Store it
in an HBase table for fast random access and extract the "metadata" on the
fly using morphlines (See:
http://kitesdk.org/docs/0.11.0/kite-morphlines/index.html). Even then
base64 conversion can be done on the fly if you like. For MapReduce jobs
you can consider sequence files as intermediate storage or AVRO as it is
more flexible. HCatalog allows you to access datasets stored in HBase (see:
https://cwiki.apache.org/confluence/display/HCATALOG/HCatalog+HBase+Integration+Design
)
If random access to all files is not required I suggest not to use HBase.
Solr-Cloud can also store the raw content beside the extracted metadata,
but MR is not that simple in this case.

Goodl luck and,
Best wishes,
Mirko

2014-10-02 9:12 GMT+01:00 Roger Maillist <da...@gmail.com>:

> Hi there
> I got millions of rather small PDF-Files which I want to load into HDFS
> for later analysis. Also I need to re-encode them as base64-stream to get
> the MR-Job for parsing work.
>
> Is there any better/faster method of just calling the 'put' function in a
> huge (bash) loop? Maybe I could implement encoding and loading as an MR-Job
> itself?
>
> Second thing is, according to a cloudera blog I read, it's a bad idea to
> store small files on HDFS, especially if there are large numbers of them.
> They recommend HBase instead. However I want to take further action via
> HCatalog...
>
> Thanks for your Suggestions
> Roger
>

Re: HDFS - many files, small size

Posted by Mirko Kämpf <mi...@gmail.com>.
Hi Roger,

you can use Apache Flume to ingest this files into your cluster. Store it
in an HBase table for fast random access and extract the "metadata" on the
fly using morphlines (See:
http://kitesdk.org/docs/0.11.0/kite-morphlines/index.html). Even then
base64 conversion can be done on the fly if you like. For MapReduce jobs
you can consider sequence files as intermediate storage or AVRO as it is
more flexible. HCatalog allows you to access datasets stored in HBase (see:
https://cwiki.apache.org/confluence/display/HCATALOG/HCatalog+HBase+Integration+Design
)
If random access to all files is not required I suggest not to use HBase.
Solr-Cloud can also store the raw content beside the extracted metadata,
but MR is not that simple in this case.

Goodl luck and,
Best wishes,
Mirko

2014-10-02 9:12 GMT+01:00 Roger Maillist <da...@gmail.com>:

> Hi there
> I got millions of rather small PDF-Files which I want to load into HDFS
> for later analysis. Also I need to re-encode them as base64-stream to get
> the MR-Job for parsing work.
>
> Is there any better/faster method of just calling the 'put' function in a
> huge (bash) loop? Maybe I could implement encoding and loading as an MR-Job
> itself?
>
> Second thing is, according to a cloudera blog I read, it's a bad idea to
> store small files on HDFS, especially if there are large numbers of them.
> They recommend HBase instead. However I want to take further action via
> HCatalog...
>
> Thanks for your Suggestions
> Roger
>

Re: HDFS - many files, small size

Posted by Mirko Kämpf <mi...@gmail.com>.
Hi Roger,

you can use Apache Flume to ingest this files into your cluster. Store it
in an HBase table for fast random access and extract the "metadata" on the
fly using morphlines (See:
http://kitesdk.org/docs/0.11.0/kite-morphlines/index.html). Even then
base64 conversion can be done on the fly if you like. For MapReduce jobs
you can consider sequence files as intermediate storage or AVRO as it is
more flexible. HCatalog allows you to access datasets stored in HBase (see:
https://cwiki.apache.org/confluence/display/HCATALOG/HCatalog+HBase+Integration+Design
)
If random access to all files is not required I suggest not to use HBase.
Solr-Cloud can also store the raw content beside the extracted metadata,
but MR is not that simple in this case.

Goodl luck and,
Best wishes,
Mirko

2014-10-02 9:12 GMT+01:00 Roger Maillist <da...@gmail.com>:

> Hi there
> I got millions of rather small PDF-Files which I want to load into HDFS
> for later analysis. Also I need to re-encode them as base64-stream to get
> the MR-Job for parsing work.
>
> Is there any better/faster method of just calling the 'put' function in a
> huge (bash) loop? Maybe I could implement encoding and loading as an MR-Job
> itself?
>
> Second thing is, according to a cloudera blog I read, it's a bad idea to
> store small files on HDFS, especially if there are large numbers of them.
> They recommend HBase instead. However I want to take further action via
> HCatalog...
>
> Thanks for your Suggestions
> Roger
>

Re: HDFS - many files, small size

Posted by Mirko Kämpf <mi...@gmail.com>.
Hi Roger,

you can use Apache Flume to ingest this files into your cluster. Store it
in an HBase table for fast random access and extract the "metadata" on the
fly using morphlines (See:
http://kitesdk.org/docs/0.11.0/kite-morphlines/index.html). Even then
base64 conversion can be done on the fly if you like. For MapReduce jobs
you can consider sequence files as intermediate storage or AVRO as it is
more flexible. HCatalog allows you to access datasets stored in HBase (see:
https://cwiki.apache.org/confluence/display/HCATALOG/HCatalog+HBase+Integration+Design
)
If random access to all files is not required I suggest not to use HBase.
Solr-Cloud can also store the raw content beside the extracted metadata,
but MR is not that simple in this case.

Goodl luck and,
Best wishes,
Mirko

2014-10-02 9:12 GMT+01:00 Roger Maillist <da...@gmail.com>:

> Hi there
> I got millions of rather small PDF-Files which I want to load into HDFS
> for later analysis. Also I need to re-encode them as base64-stream to get
> the MR-Job for parsing work.
>
> Is there any better/faster method of just calling the 'put' function in a
> huge (bash) loop? Maybe I could implement encoding and loading as an MR-Job
> itself?
>
> Second thing is, according to a cloudera blog I read, it's a bad idea to
> store small files on HDFS, especially if there are large numbers of them.
> They recommend HBase instead. However I want to take further action via
> HCatalog...
>
> Thanks for your Suggestions
> Roger
>