You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by "Fustes, Diego" <Di...@ndt-global.com> on 2016/01/07 11:55:13 UTC

Bulk load for binay file formats

Hi all,

In our project we need to ingest big amounts of data (1TB stored in custom binary files) to HBase using Phoenix. To do so, at the moment, we are converting the binary files to CSV and using the bulk load tool included in Phoenix. Unfortunately, such process takes too long given that we need to store big files in HDFS (10TB in CSV), and then run the MapReduce job to convert these files to HFiles.

I think that it should be considerably faster and compact to use another file format (For example Avro) as intermediate storage for bulk loading. Could this be implemented in the next releases of Phoenix?

Another possibility is that we create the HFiles directly in our code. How complex would that be?

With kind regards,

Diego



[Description: Description: cid:image001.png@01CF4378.72EDFE50]
NDT GDAC Spain S.L.
Diego Fustes, Big Data and Machine Learning Expert
Gran Vía de les Corts Catalanes 130, 11th floor
08038 Barcelona, Spain
Phone: +34 93 43 255 27
diego.fustes@ndt-global.com<ma...@ndt-global.com>
www.ndt-global.com<http://www.ndt-global.com/>


-- 
This email is intended only for the recipient(s) designated above.  Any dissemination, distribution, copying, or use of the information contained herein by anyone other than the recipient(s) designated by the sender is unauthorized and strictly prohibited and subject to legal privilege.  If you have received this e-mail in error, please notify the sender immediately and delete and destroy this email.

Der Inhalt dieser E-Mail und deren Anhänge sind vertraulich. Wenn Sie nicht der Adressat sind, informieren Sie bitte den Absender unverzüglich, verwenden Sie den Inhalt nicht und löschen Sie die E-Mail sofort.

NDT Global GmbH and Co. KG,  Friedrich-List-Str. 1, D-76297 Stutensee, Germany
Registry Court Mannheim
HRA 704288

Personally liable partner: 
NDT Verwaltungs GmbH
Friedrich-List-Straße 1, D-76297 Stutensee, Germany
Registry Court Mannheim
HRB 714639
CEO: Gunther Blitz






Re: Bulk load for binay file formats

Posted by Nick Dimiduk <nd...@apache.org>.
Hi Diego,

I recommend the latter -- creating HFiles directly from your application.
That is, unless you have a specific need for the intermediate format.

I recently did some work in this area, abstracting the bulkload tooling
somewhat to add support for loading from JSON files. I support a
continuation in this effort of abstraction/refactoring. Have a look at the
code in and around o.a.p.mapreduce.AbstractBulkLoadTool. Probably you can
implement your custom format reader based on that harness. If not, I'm
happy to review/commit any changes necessary to support other extensions.

Unfortunately right now the only API interface compatibility we support
across versions is the SQL interface. Which means we may make changes to
these classes from release to release. Perhaps not terribly often, but keep
this in mind as you press forward with your efforts.

Let us know if you have further questions,
-n

On Thursday, January 7, 2016, Fustes, Diego <Di...@ndt-global.com>
wrote:

> Hi all,
>
>
>
> In our project we need to ingest big amounts of data (1TB stored in custom
> binary files) to HBase using Phoenix. To do so, at the moment, we are
> converting the binary files to CSV and using the bulk load tool included in
> Phoenix. Unfortunately, such process takes too long given that we need to
> store big files in HDFS (10TB in CSV), and then run the MapReduce job to
> convert these files to HFiles.
>
>
>
> I think that it should be considerably faster and compact to use another
> file format (For example Avro) as intermediate storage for bulk loading.
> Could this be implemented in the next releases of Phoenix?
>
>
>
> Another possibility is that we create the HFiles directly in our code. How
> complex would that be?
>
>
>
> With kind regards,
>
>
>
> Diego
>
>
>
>
>
>
>
> [image: Description: Description: cid:image001.png@01CF4378.72EDFE50]
>
> *NDT GDAC Spain S.L.*
>
> Diego Fustes, Big Data and Machine Learning Expert
>
> Gran Vía de les Corts Catalanes 130, 11th floor
>
> 08038 Barcelona, Spain
>
> Phone: +34 93 43 255 27
>
> diego.fustes@ndt-global.com
> <javascript:_e(%7B%7D,'cvml','diego.fustes@ndt-global.com');>
>
> *www.ndt-global.com <http://www.ndt-global.com/>*
>
>
>
> --
> This email is intended only for the recipient(s) designated above.  Any dissemination, distribution, copying, or use of the information contained herein by anyone other than the recipient(s) designated by the sender is unauthorized and strictly prohibited and subject to legal privilege.  If you have received this e-mail in error, please notify the sender immediately and delete and destroy this email.
>
> Der Inhalt dieser E-Mail und deren Anhänge sind vertraulich. Wenn Sie nicht der Adressat sind, informieren Sie bitte den Absender unverzüglich, verwenden Sie den Inhalt nicht und löschen Sie die E-Mail sofort.
>
> NDT Global GmbH and Co. KG,  Friedrich-List-Str. 1, D-76297 Stutensee, Germany
> Registry Court Mannheim
> HRA 704288
>
> Personally liable partner:
> NDT Verwaltungs GmbH
> Friedrich-List-Straße 1, D-76297 Stutensee, Germany
> Registry Court Mannheim
> HRB 714639
> CEO: Gunther Blitz
>
>
>
>
>
>