You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@phoenix.apache.org by Gabriel Reid <gr...@apache.org> on 2014/03/15 22:16:56 UTC

Review Request 19257: PHOENIX-129 - Improve MapReduce import

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19257/
-----------------------------------------------------------

Review request for phoenix.


Repository: phoenix


Description
-------

Rewrite of the Phoenix MapReduce import, to follow more standard MR tool development patterns and use standard Phoenix CSV handling functionality.


Diffs
-----

  bin/csv-bulk-loader.py 385ef41 
  bin/readme.txt fa23eeb 
  phoenix-assembly/pom.xml 9a69cab 
  phoenix-assembly/src/build/all.xml 9c1bc41 
  phoenix-assembly/src/build/mapreduce.xml PRE-CREATION 
  phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java PRE-CREATION 
  phoenix-core/src/main/java/org/apache/phoenix/map/reduce/CSVBulkLoader.java 0fd74e1 
  phoenix-core/src/main/java/org/apache/phoenix/map/reduce/MapReduceJob.java 9dc8032 
  phoenix-core/src/main/java/org/apache/phoenix/map/reduce/util/ConfigReader.java 1c94d6f 
  phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkImportUtil.java PRE-CREATION 
  phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java PRE-CREATION 
  phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapper.java PRE-CREATION 
  phoenix-core/src/main/java/org/apache/phoenix/mapreduce/ImportPreUpsertKeyValueProcessor.java PRE-CREATION 
  phoenix-core/src/main/java/org/apache/phoenix/schema/PDataType.java 34166d3 
  phoenix-core/src/main/java/org/apache/phoenix/util/CSVCommonsLoader.java 3e86477 
  phoenix-core/src/main/java/org/apache/phoenix/util/ColumnInfo.java 37238c8 
  phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkImportUtilTest.java PRE-CREATION 
  phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolTest.java PRE-CREATION 
  phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapperTest.java PRE-CREATION 
  phoenix-core/src/test/java/org/apache/phoenix/util/ColumnInfoTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/19257/diff/


Testing
-------


Thanks,

Gabriel Reid

Re: Review Request 19257: PHOENIX-129 - Improve MapReduce import

Posted by Gabriel Reid <gr...@apache.org>.


> On March 17, 2014, 8:13 a.m., Gabriel Reid wrote:
> >

This review was created in error while responding to Prashant's comments -- it can be ignored.


- Gabriel


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19257/#review37348
-----------------------------------------------------------


On March 15, 2014, 9:16 p.m., Gabriel Reid wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/19257/
> -----------------------------------------------------------
> 
> (Updated March 15, 2014, 9:16 p.m.)
> 
> 
> Review request for phoenix.
> 
> 
> Repository: phoenix
> 
> 
> Description
> -------
> 
> Rewrite of the Phoenix MapReduce import, to follow more standard MR tool development patterns and use standard Phoenix CSV handling functionality.
> 
> 
> Diffs
> -----
> 
>   bin/csv-bulk-loader.py 385ef41 
>   bin/readme.txt fa23eeb 
>   phoenix-assembly/pom.xml 9a69cab 
>   phoenix-assembly/src/build/all.xml 9c1bc41 
>   phoenix-assembly/src/build/mapreduce.xml PRE-CREATION 
>   phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/CSVBulkLoader.java 0fd74e1 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/MapReduceJob.java 9dc8032 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/util/ConfigReader.java 1c94d6f 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkImportUtil.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapper.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/ImportPreUpsertKeyValueProcessor.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/schema/PDataType.java 34166d3 
>   phoenix-core/src/main/java/org/apache/phoenix/util/CSVCommonsLoader.java 3e86477 
>   phoenix-core/src/main/java/org/apache/phoenix/util/ColumnInfo.java 37238c8 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkImportUtilTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapperTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/util/ColumnInfoTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/19257/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Gabriel Reid
> 
>

Re: Review Request 19257: PHOENIX-129 - Improve MapReduce import

Posted by Gabriel Reid <gr...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19257/#review37348
-----------------------------------------------------------



bin/readme.txt
<https://reviews.apache.org/r/19257/#comment68876>

    Yes, it's (currently) always temporary. The HFiles that are written as part of the import process (but thrown away after they've been added to HBase) are written to a random directory under /tmp (in HDFS) by default, but depending on the setup of HDFS and the existence of directory permissions on it, it may be necessary to specify a different temp path to use (e.g. something under the user directory). Ideally this parameter wouldn't need to exist at all, but I think it is needed.
    
    James also mentioned a potential use case of writing HFiles for a table that doesn't exist -- I can see this being useful if you've got a separate MR cluster from your HBase cluster, and want to build up the HFiles quickly without putting a load on your HBase cluster. That's not a use case that's currently supported, but could become useful in the future.



phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java
<https://reviews.apache.org/r/19257/#comment68869>

    This is a file being created on HDFS (not local FS), so couldn't platform-agnostic separators cause problems here?
    
    Closing it in a finally sounds like a good plan -- that being said, again this is on a transient HDFS cluster within the test, so there is no garbage that is going to be left behind after the test stops no matter what. 



phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
<https://reviews.apache.org/r/19257/#comment68870>

    No, it's used by the unit tests, so it needs to be package private. It should be annotated with @VisibleForTesting though.



phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
<https://reviews.apache.org/r/19257/#comment68871>

    This is a path on HDFS, so java.io.tmpdir will point to something else. The explanation on the approach for the creating of this tempdir is in a comment above.



phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
<https://reviews.apache.org/r/19257/#comment68872>

    Again, used by unit tests, but should be annotated with @VisibleForTesting.



phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
<https://reviews.apache.org/r/19257/#comment68873>

    Yep, this one really should be private :-)



phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
<https://reviews.apache.org/r/19257/#comment68874>

    Yes, nice catch on that.



phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapper.java
<https://reviews.apache.org/r/19257/#comment68875>

    Yes, good point -- I think I had a specific reason for doing it this way, but I don't remember what the exact reason was and unfortunately I didn't document it here. Should be double-checked.


- Gabriel Reid


On March 15, 2014, 9:16 p.m., Gabriel Reid wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/19257/
> -----------------------------------------------------------
> 
> (Updated March 15, 2014, 9:16 p.m.)
> 
> 
> Review request for phoenix.
> 
> 
> Repository: phoenix
> 
> 
> Description
> -------
> 
> Rewrite of the Phoenix MapReduce import, to follow more standard MR tool development patterns and use standard Phoenix CSV handling functionality.
> 
> 
> Diffs
> -----
> 
>   bin/csv-bulk-loader.py 385ef41 
>   bin/readme.txt fa23eeb 
>   phoenix-assembly/pom.xml 9a69cab 
>   phoenix-assembly/src/build/all.xml 9c1bc41 
>   phoenix-assembly/src/build/mapreduce.xml PRE-CREATION 
>   phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/CSVBulkLoader.java 0fd74e1 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/MapReduceJob.java 9dc8032 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/util/ConfigReader.java 1c94d6f 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkImportUtil.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapper.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/ImportPreUpsertKeyValueProcessor.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/schema/PDataType.java 34166d3 
>   phoenix-core/src/main/java/org/apache/phoenix/util/CSVCommonsLoader.java 3e86477 
>   phoenix-core/src/main/java/org/apache/phoenix/util/ColumnInfo.java 37238c8 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkImportUtilTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapperTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/util/ColumnInfoTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/19257/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Gabriel Reid
> 
>

Re: Review Request 19257: PHOENIX-129 - Improve MapReduce import

Posted by Prashant Kommireddi <pr...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19257/#review37381
-----------------------------------------------------------

Ship it!


Ship It!

- Prashant Kommireddi


On March 15, 2014, 9:16 p.m., Gabriel Reid wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/19257/
> -----------------------------------------------------------
> 
> (Updated March 15, 2014, 9:16 p.m.)
> 
> 
> Review request for phoenix.
> 
> 
> Repository: phoenix
> 
> 
> Description
> -------
> 
> Rewrite of the Phoenix MapReduce import, to follow more standard MR tool development patterns and use standard Phoenix CSV handling functionality.
> 
> 
> Diffs
> -----
> 
>   bin/csv-bulk-loader.py 385ef41 
>   bin/readme.txt fa23eeb 
>   phoenix-assembly/pom.xml 9a69cab 
>   phoenix-assembly/src/build/all.xml 9c1bc41 
>   phoenix-assembly/src/build/mapreduce.xml PRE-CREATION 
>   phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/CSVBulkLoader.java 0fd74e1 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/MapReduceJob.java 9dc8032 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/util/ConfigReader.java 1c94d6f 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkImportUtil.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapper.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/ImportPreUpsertKeyValueProcessor.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/schema/PDataType.java 34166d3 
>   phoenix-core/src/main/java/org/apache/phoenix/util/CSVCommonsLoader.java 3e86477 
>   phoenix-core/src/main/java/org/apache/phoenix/util/ColumnInfo.java 37238c8 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkImportUtilTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapperTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/util/ColumnInfoTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/19257/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Gabriel Reid
> 
>

Re: Review Request 19257: PHOENIX-129 - Improve MapReduce import

Posted by Gabriel Reid <gr...@apache.org>.


> On March 17, 2014, 12:50 a.m., Prashant Kommireddi wrote:
> > phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java, line 76
> > <https://reviews.apache.org/r/19257/diff/1/?file=520485#file520485line76>
> >
> >     You might want to do this in a try-catch and do closing of stream/writer in a finally.
> >     
> >     Also, creating the tmp path would be better if its platform agnostic. See http://docs.oracle.com/javase/7/docs/api/java/io/File.html#createTempFile(java.lang.String, java.lang.String)

This is a file being created on HDFS (not local FS), so couldn't platform-agnostic separators cause problems here?

Closing it in a finally sounds like a good plan -- that being said, again this is on a transient HDFS cluster within the test, so there is no garbage that is going to be left behind after the test stops no matter what. 


> On March 17, 2014, 12:50 a.m., Prashant Kommireddi wrote:
> > phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java, line 92
> > <https://reviews.apache.org/r/19257/diff/1/?file=520490#file520490line92>
> >
> >     Could this be a private method?

No, it's used by the unit tests, so it needs to be package private. It should be annotated with @VisibleForTesting though.


> On March 17, 2014, 12:50 a.m., Prashant Kommireddi wrote:
> > phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java, line 182
> > <https://reviews.apache.org/r/19257/diff/1/?file=520490#file520490line182>
> >
> >     Probably using "java.io.tmpdir" is better.

This is a path on HDFS, so java.io.tmpdir will point to something else. The explanation on the approach for the creating of this tempdir is in a comment above.


> On March 17, 2014, 12:50 a.m., Prashant Kommireddi wrote:
> > phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java, line 231
> > <https://reviews.apache.org/r/19257/diff/1/?file=520490#file520490line231>
> >
> >     private?

Again, used by unit tests, but should be annotated with @VisibleForTesting.


> On March 17, 2014, 12:50 a.m., Prashant Kommireddi wrote:
> > phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java, line 248
> > <https://reviews.apache.org/r/19257/diff/1/?file=520490#file520490line248>
> >
> >     private method?

Yep, this one really should be private :-)


> On March 17, 2014, 12:50 a.m., Prashant Kommireddi wrote:
> > phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java, line 339
> > <https://reviews.apache.org/r/19257/diff/1/?file=520490#file520490line339>
> >
> >     Should this be done in a finally and the exception thrown back anyway?

Yes, nice catch on that.


> On March 17, 2014, 12:50 a.m., Prashant Kommireddi wrote:
> > phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapper.java, line 176
> > <https://reviews.apache.org/r/19257/diff/1/?file=520491#file520491line176>
> >
> >     Do you think we should catch a specific exception here?

Yes, good point -- I think I had a specific reason for doing it this way, but I don't remember what the exact reason was and unfortunately I didn't document it here. Should be double-checked.


- Gabriel


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19257/#review37328
-----------------------------------------------------------


On March 15, 2014, 9:16 p.m., Gabriel Reid wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/19257/
> -----------------------------------------------------------
> 
> (Updated March 15, 2014, 9:16 p.m.)
> 
> 
> Review request for phoenix.
> 
> 
> Repository: phoenix
> 
> 
> Description
> -------
> 
> Rewrite of the Phoenix MapReduce import, to follow more standard MR tool development patterns and use standard Phoenix CSV handling functionality.
> 
> 
> Diffs
> -----
> 
>   bin/csv-bulk-loader.py 385ef41 
>   bin/readme.txt fa23eeb 
>   phoenix-assembly/pom.xml 9a69cab 
>   phoenix-assembly/src/build/all.xml 9c1bc41 
>   phoenix-assembly/src/build/mapreduce.xml PRE-CREATION 
>   phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/CSVBulkLoader.java 0fd74e1 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/MapReduceJob.java 9dc8032 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/util/ConfigReader.java 1c94d6f 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkImportUtil.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapper.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/ImportPreUpsertKeyValueProcessor.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/schema/PDataType.java 34166d3 
>   phoenix-core/src/main/java/org/apache/phoenix/util/CSVCommonsLoader.java 3e86477 
>   phoenix-core/src/main/java/org/apache/phoenix/util/ColumnInfo.java 37238c8 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkImportUtilTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapperTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/util/ColumnInfoTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/19257/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Gabriel Reid
> 
>

Re: Review Request 19257: PHOENIX-129 - Improve MapReduce import

Posted by Prashant Kommireddi <pr...@gmail.com>.


> On March 17, 2014, 12:50 a.m., Prashant Kommireddi wrote:
> > bin/readme.txt, line 63
> > <https://reviews.apache.org/r/19257/diff/1/?file=520481#file520481line63>
> >
> >     Is this always a temporary position?
> 
> Gabriel Reid wrote:
>     Yes, it's (currently) always temporary. The HFiles that are written as part of the import process (but thrown away after they've been added to HBase) are written to a random directory under /tmp (in HDFS) by default, but depending on the setup of HDFS and the existence of directory permissions on it, it may be necessary to specify a different temp path to use (e.g. something under the user directory). Ideally this parameter wouldn't need to exist at all, but I think it is needed.
>     
>     James also mentioned a potential use case of writing HFiles for a table that doesn't exist -- I can see this being useful if you've got a separate MR cluster from your HBase cluster, and want to build up the HFiles quickly without putting a load on your HBase cluster. That's not a use case that's currently supported, but could become useful in the future.

Yep I remember having read about the latter in which case the output directory might not be considered temp. Since that's not supported currently, not something to worry about now.


> On March 17, 2014, 12:50 a.m., Prashant Kommireddi wrote:
> > phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java, line 42
> > <https://reviews.apache.org/r/19257/diff/1/?file=520485#file520485line42>
> >
> >     Should this be named *Test, say CsvBulkLoadToolTest ?
> 
> Gabriel Reid wrote:
>     No, as of PHOENIX-130 all slow-running integration tests are run by failsafe, under a separate source root. The "*IT.java" naming is used to help failsafe to distinguish them from fast-running unit tests.

Good to know, I wasn't aware. I'm learning a bit about Phoenix :)


> On March 17, 2014, 12:50 a.m., Prashant Kommireddi wrote:
> > phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java, line 76
> > <https://reviews.apache.org/r/19257/diff/1/?file=520485#file520485line76>
> >
> >     You might want to do this in a try-catch and do closing of stream/writer in a finally.
> >     
> >     Also, creating the tmp path would be better if its platform agnostic. See http://docs.oracle.com/javase/7/docs/api/java/io/File.html#createTempFile(java.lang.String, java.lang.String)
> 
> Gabriel Reid wrote:
>     This is a file being created on HDFS (not local FS), so couldn't platform-agnostic separators cause problems here?
>     
>     Closing it in a finally sounds like a good plan -- that being said, again this is on a transient HDFS cluster within the test, so there is no garbage that is going to be left behind after the test stops no matter what.

I'm sorry, you are right this is writing to HDFS. Ignore my comment regarding using File API.

Regarding closing streams, I'm less worried about the garbage and more about closing streams cleanly. Would the streams be released/closed at the end of tests regardless?


> On March 17, 2014, 12:50 a.m., Prashant Kommireddi wrote:
> > phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java, line 92
> > <https://reviews.apache.org/r/19257/diff/1/?file=520490#file520490line92>
> >
> >     Could this be a private method?
> 
> Gabriel Reid wrote:
>     No, it's used by the unit tests, so it needs to be package private. It should be annotated with @VisibleForTesting though.

Sounds good


> On March 17, 2014, 12:50 a.m., Prashant Kommireddi wrote:
> > phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java, line 182
> > <https://reviews.apache.org/r/19257/diff/1/?file=520490#file520490line182>
> >
> >     Probably using "java.io.tmpdir" is better.
> 
> Gabriel Reid wrote:
>     This is a path on HDFS, so java.io.tmpdir will point to something else. The explanation on the approach for the creating of this tempdir is in a comment above.

Yep, my bad.


- Prashant


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19257/#review37328
-----------------------------------------------------------


On March 15, 2014, 9:16 p.m., Gabriel Reid wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/19257/
> -----------------------------------------------------------
> 
> (Updated March 15, 2014, 9:16 p.m.)
> 
> 
> Review request for phoenix.
> 
> 
> Repository: phoenix
> 
> 
> Description
> -------
> 
> Rewrite of the Phoenix MapReduce import, to follow more standard MR tool development patterns and use standard Phoenix CSV handling functionality.
> 
> 
> Diffs
> -----
> 
>   bin/csv-bulk-loader.py 385ef41 
>   bin/readme.txt fa23eeb 
>   phoenix-assembly/pom.xml 9a69cab 
>   phoenix-assembly/src/build/all.xml 9c1bc41 
>   phoenix-assembly/src/build/mapreduce.xml PRE-CREATION 
>   phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/CSVBulkLoader.java 0fd74e1 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/MapReduceJob.java 9dc8032 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/util/ConfigReader.java 1c94d6f 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkImportUtil.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapper.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/ImportPreUpsertKeyValueProcessor.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/schema/PDataType.java 34166d3 
>   phoenix-core/src/main/java/org/apache/phoenix/util/CSVCommonsLoader.java 3e86477 
>   phoenix-core/src/main/java/org/apache/phoenix/util/ColumnInfo.java 37238c8 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkImportUtilTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapperTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/util/ColumnInfoTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/19257/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Gabriel Reid
> 
>

Re: Review Request 19257: PHOENIX-129 - Improve MapReduce import

Posted by Gabriel Reid <gr...@apache.org>.


> On March 17, 2014, 12:50 a.m., Prashant Kommireddi wrote:
> > phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java, line 76
> > <https://reviews.apache.org/r/19257/diff/1/?file=520485#file520485line76>
> >
> >     You might want to do this in a try-catch and do closing of stream/writer in a finally.
> >     
> >     Also, creating the tmp path would be better if its platform agnostic. See http://docs.oracle.com/javase/7/docs/api/java/io/File.html#createTempFile(java.lang.String, java.lang.String)
> 
> Gabriel Reid wrote:
>     This is a file being created on HDFS (not local FS), so couldn't platform-agnostic separators cause problems here?
>     
>     Closing it in a finally sounds like a good plan -- that being said, again this is on a transient HDFS cluster within the test, so there is no garbage that is going to be left behind after the test stops no matter what.
> 
> Prashant Kommireddi wrote:
>     I'm sorry, you are right this is writing to HDFS. Ignore my comment regarding using File API.
>     
>     Regarding closing streams, I'm less worried about the garbage and more about closing streams cleanly. Would the streams be released/closed at the end of tests regardless?

The open streams are to a HDFS/HBase cluster that is running only for the duration of the test, so I believe they'll be closed regardless at the end of the test (when the cluster gets destroyed). I'm often a bit torn about situations like this -- on the one hand it feels wrong to not ensure that things like this are cleanly closed, but on the other hand it can make the test a lot less readable if all the closing is done *really* correctly.


- Gabriel


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19257/#review37328
-----------------------------------------------------------


On March 15, 2014, 9:16 p.m., Gabriel Reid wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/19257/
> -----------------------------------------------------------
> 
> (Updated March 15, 2014, 9:16 p.m.)
> 
> 
> Review request for phoenix.
> 
> 
> Repository: phoenix
> 
> 
> Description
> -------
> 
> Rewrite of the Phoenix MapReduce import, to follow more standard MR tool development patterns and use standard Phoenix CSV handling functionality.
> 
> 
> Diffs
> -----
> 
>   bin/csv-bulk-loader.py 385ef41 
>   bin/readme.txt fa23eeb 
>   phoenix-assembly/pom.xml 9a69cab 
>   phoenix-assembly/src/build/all.xml 9c1bc41 
>   phoenix-assembly/src/build/mapreduce.xml PRE-CREATION 
>   phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/CSVBulkLoader.java 0fd74e1 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/MapReduceJob.java 9dc8032 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/util/ConfigReader.java 1c94d6f 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkImportUtil.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapper.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/ImportPreUpsertKeyValueProcessor.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/schema/PDataType.java 34166d3 
>   phoenix-core/src/main/java/org/apache/phoenix/util/CSVCommonsLoader.java 3e86477 
>   phoenix-core/src/main/java/org/apache/phoenix/util/ColumnInfo.java 37238c8 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkImportUtilTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapperTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/util/ColumnInfoTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/19257/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Gabriel Reid
> 
>

Re: Review Request 19257: PHOENIX-129 - Improve MapReduce import

Posted by Gabriel Reid <gr...@apache.org>.


> On March 17, 2014, 12:50 a.m., Prashant Kommireddi wrote:
> > bin/readme.txt, line 63
> > <https://reviews.apache.org/r/19257/diff/1/?file=520481#file520481line63>
> >
> >     Is this always a temporary position?

Yes, it's (currently) always temporary. The HFiles that are written as part of the import process (but thrown away after they've been added to HBase) are written to a random directory under /tmp (in HDFS) by default, but depending on the setup of HDFS and the existence of directory permissions on it, it may be necessary to specify a different temp path to use (e.g. something under the user directory). Ideally this parameter wouldn't need to exist at all, but I think it is needed.

James also mentioned a potential use case of writing HFiles for a table that doesn't exist -- I can see this being useful if you've got a separate MR cluster from your HBase cluster, and want to build up the HFiles quickly without putting a load on your HBase cluster. That's not a use case that's currently supported, but could become useful in the future.


> On March 17, 2014, 12:50 a.m., Prashant Kommireddi wrote:
> > phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java, line 42
> > <https://reviews.apache.org/r/19257/diff/1/?file=520485#file520485line42>
> >
> >     Should this be named *Test, say CsvBulkLoadToolTest ?

No, as of PHOENIX-130 all slow-running integration tests are run by failsafe, under a separate source root. The "*IT.java" naming is used to help failsafe to distinguish them from fast-running unit tests.


- Gabriel


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19257/#review37328
-----------------------------------------------------------


On March 15, 2014, 9:16 p.m., Gabriel Reid wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/19257/
> -----------------------------------------------------------
> 
> (Updated March 15, 2014, 9:16 p.m.)
> 
> 
> Review request for phoenix.
> 
> 
> Repository: phoenix
> 
> 
> Description
> -------
> 
> Rewrite of the Phoenix MapReduce import, to follow more standard MR tool development patterns and use standard Phoenix CSV handling functionality.
> 
> 
> Diffs
> -----
> 
>   bin/csv-bulk-loader.py 385ef41 
>   bin/readme.txt fa23eeb 
>   phoenix-assembly/pom.xml 9a69cab 
>   phoenix-assembly/src/build/all.xml 9c1bc41 
>   phoenix-assembly/src/build/mapreduce.xml PRE-CREATION 
>   phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/CSVBulkLoader.java 0fd74e1 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/MapReduceJob.java 9dc8032 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/util/ConfigReader.java 1c94d6f 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkImportUtil.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapper.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/ImportPreUpsertKeyValueProcessor.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/schema/PDataType.java 34166d3 
>   phoenix-core/src/main/java/org/apache/phoenix/util/CSVCommonsLoader.java 3e86477 
>   phoenix-core/src/main/java/org/apache/phoenix/util/ColumnInfo.java 37238c8 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkImportUtilTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapperTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/util/ColumnInfoTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/19257/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Gabriel Reid
> 
>

Re: Review Request 19257: PHOENIX-129 - Improve MapReduce import

Posted by Prashant Kommireddi <pr...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19257/#review37328
-----------------------------------------------------------



bin/readme.txt
<https://reviews.apache.org/r/19257/#comment68828>

    Is this always a temporary position?



phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java
<https://reviews.apache.org/r/19257/#comment68829>

    Should this be named *Test, say CsvBulkLoadToolTest ?



phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java
<https://reviews.apache.org/r/19257/#comment68830>

    You might want to do this in a try-catch and do closing of stream/writer in a finally.
    
    Also, creating the tmp path would be better if its platform agnostic. See http://docs.oracle.com/javase/7/docs/api/java/io/File.html#createTempFile(java.lang.String, java.lang.String)



phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java
<https://reviews.apache.org/r/19257/#comment68831>

    Same here, this should probably be done in a finally.



phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java
<https://reviews.apache.org/r/19257/#comment68832>

    Same as above



phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkImportUtil.java
<https://reviews.apache.org/r/19257/#comment68833>

    You could have a private constructor if it's only static methods on this.



phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
<https://reviews.apache.org/r/19257/#comment68834>

    Could this be a private method?



phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
<https://reviews.apache.org/r/19257/#comment68835>

    Probably using "java.io.tmpdir" is better.



phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
<https://reviews.apache.org/r/19257/#comment68836>

    private?



phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
<https://reviews.apache.org/r/19257/#comment68837>

    private method?



phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
<https://reviews.apache.org/r/19257/#comment68838>

    Should this be done in a finally and the exception thrown back anyway?



phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapper.java
<https://reviews.apache.org/r/19257/#comment68839>

    Do you think we should catch a specific exception here?


- Prashant Kommireddi


On March 15, 2014, 9:16 p.m., Gabriel Reid wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/19257/
> -----------------------------------------------------------
> 
> (Updated March 15, 2014, 9:16 p.m.)
> 
> 
> Review request for phoenix.
> 
> 
> Repository: phoenix
> 
> 
> Description
> -------
> 
> Rewrite of the Phoenix MapReduce import, to follow more standard MR tool development patterns and use standard Phoenix CSV handling functionality.
> 
> 
> Diffs
> -----
> 
>   bin/csv-bulk-loader.py 385ef41 
>   bin/readme.txt fa23eeb 
>   phoenix-assembly/pom.xml 9a69cab 
>   phoenix-assembly/src/build/all.xml 9c1bc41 
>   phoenix-assembly/src/build/mapreduce.xml PRE-CREATION 
>   phoenix-core/src/it/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolIT.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/CSVBulkLoader.java 0fd74e1 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/MapReduceJob.java 9dc8032 
>   phoenix-core/src/main/java/org/apache/phoenix/map/reduce/util/ConfigReader.java 1c94d6f 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkImportUtil.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapper.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/mapreduce/ImportPreUpsertKeyValueProcessor.java PRE-CREATION 
>   phoenix-core/src/main/java/org/apache/phoenix/schema/PDataType.java 34166d3 
>   phoenix-core/src/main/java/org/apache/phoenix/util/CSVCommonsLoader.java 3e86477 
>   phoenix-core/src/main/java/org/apache/phoenix/util/ColumnInfo.java 37238c8 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkImportUtilTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvBulkLoadToolTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/mapreduce/CsvToKeyValueMapperTest.java PRE-CREATION 
>   phoenix-core/src/test/java/org/apache/phoenix/util/ColumnInfoTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/19257/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Gabriel Reid
> 
>