You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@phoenix.apache.org by Dhruv Gohil <yo...@gmail.com> on 2015/10/29 21:26:43 UTC

Re: replace CsvToKeyValueMapper with my

+1

FYI:    We did it(phoenix 4.2.2) by copy pasting whole "CsvBulkLoadTool" 
and changing the pieces we want "Custom parser", getting back "job 
counters" to take downstream decisions etc..

+1 for pluggability, but we don't know how stable the interface would be 
(should we even publish it?),
      A wild idea is to instead of "inventing proper interface" if we 
can refector the logic out of org.apache.phoenix.mapreduce.Csv* (3 
classes) to make the current implementation independent of "CSV" and 
"Mapreduce"
    that way CsvBulkLoadTool will be lightweight default reference and 
people might just extend/copy it to customize MOST of behaviour.

P.S.: We are gonna give a shot to pick up record directly from kafka 
instead of a CSV file soon.

On Thursday 29 October 2015 03:38 PM, Bulvik, Noam wrote:
>
> This is exactly what I need i.e. to be able to change the content of 
> the row rather than different input format.
>
> The use case is when you need to load large amount of data from files 
> and each row needs to be handled before it is been processed by the 
> CSV parser.  Examples can be change date format, fix encoding, escape 
> delimiters and more. Of course this can be done in different 
> map-reduce job but since we are already processing each row then it 
> would be nice if we can do it there.
>
> *erom:*James Taylor [mailto:jamestaylor@apache.org]
> *Sent:* Thursday, October 29, 2015 7:33 PM
> *To:* user <us...@phoenix.apache.org>
> *Subject:* Re: replace CsvToKeyValueMapper with my implementation
>
> I seem to remember you starting down that path, Gabriel - a kind of 
> pluggable transformation for each row. It wasn't pluggable on the 
> input format, but that's a nice idea too, Ravi. I'm not sure if this 
> is what Noam needs or if it's something else.
>
> Probably good to discuss a bit more at the use case level to 
> understand the specifics a bit more.
>
> On Thu, Oct 29, 2015 at 9:17 AM, Ravi Kiran <maghamravikiran@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     It would be great if we can provide an api and have end users
>     provided implementation on how to parse each record . This way, we
>     can move away with only bulk loading csv and have json and other
>     formats of input bulk loaded onto phoenix tables.
>
>     I can take that one up. Would it be something the community like
>     as a feature ?
>
>     On Thu, Oct 29, 2015 at 8:10 AM, Gabriel Reid
>     <gabriel.reid@gmail.com <ma...@gmail.com>> wrote:
>
>         Hi Noam,
>
>         That specific piece of code in CsvBulkLoadTool that you
>         referred to
>         allows packaging the CsvBulkLoadTool within a different job
>         jar file,
>         but won't allow setting a different mapper class. The actual
>         setting
>         of the mapper class is done further down in the submitJob method,
>         specifically the following piece:
>
>          job.setMapperClass(CsvToKeyValueMapper.class);
>
>         There isn't currently a way to load a custom mapper in the
>         CsvBulkLoadTool, so the only (current) option is to create a
>         fully new
>         custom implementation of the bulk load tool (probably copying or
>         reusing most of the existing tool). However, I can certainly
>         imagine
>         this being a useful feature to have in some situations.
>
>         Could you log this request in jira? It would also be really
>         good to
>         have some more detail on your specific use case. And even
>         better is a
>         patch that implements it :-)
>
>         - Gabriel
>
>
>
>         On Thu, Oct 29, 2015 at 3:22 PM, Bulvik, Noam
>         <Noam.Bulvik@teoco.com <ma...@teoco.com>> wrote:
>         > Hi,
>         >
>         >
>         >
>         > We have private logic to be executed when parsing each line
>         before it is
>         > uploaded to phoenix. I saw the following in the code of the
>         CsvBulkLoadTool
>         >
>         > // Allow overriding the job jar setting by using a -D system
>         property at
>         > startup
>         >
>         > if (job.getJar() == null)
>         >
>         >  {
>         >
>         >
>         > job.setJarByClass(CsvToKeyValueMapper.class);
>         >
>         >                  }
>         >
>         >
>         >
>         > Assuming I have the implementation for MyKeyValueMapper how
>         can I make sure
>         > it will be loaded instead of standard one ?
>         >
>         >
>         >
>         > Also in CsvToKeyValueMapper class there are some private
>         members like
>         >
>         > ·         private PhoenixConnection conn;
>         >
>         > ·         private byte[] tableName;
>         >
>         >
>         >
>         > can you add option to access these member or make them
>         protected so we will
>         > be able to use them in the class we create that extends
>         CsvToKeyValueMapper
>         > and not to duplicate them and the code that init them
>         >
>         >
>         >
>         > we are using  phoenix 4.5.2 over CDH
>         >
>         >
>         >
>         > thanks
>         >
>         > Noam
>         >
>         >
>         >
>         > Noam Bulvik
>         >
>         > R&D Manager
>         >
>         >
>         >
>         > TEOCO CORPORATION
>         >
>         > c: +972 54 5507984 <tel:%2B972%2054%205507984>
>         >
>         > p: +972 3 9269145 <tel:%2B972%203%209269145>
>         >
>         > Noam.Bulvik@teoco.com <ma...@teoco.com>
>         >
>         > www.teoco.com <http://www.teoco.com>
>         >
>         >
>         >
>         >
>         > ________________________________
>         >
>         > PRIVILEGED AND CONFIDENTIAL
>         > PLEASE NOTE: The information contained in this message is
>         privileged and
>         > confidential, and is intended only for the use of the
>         individual to whom it
>         > is addressed and others who have been specifically
>         authorized to receive it.
>         > If you are not the intended recipient, you are hereby
>         notified that any
>         > dissemination, distribution or copying of this communication
>         is strictly
>         > prohibited. If you have received this communication in
>         error, or if any
>         > problems occur with transmission, please contact sender.
>         Thank you.
>
>
> ------------------------------------------------------------------------
>
> PRIVILEGED AND CONFIDENTIAL
> PLEASE NOTE: The information contained in this message is privileged 
> and confidential, and is intended only for the use of the individual 
> to whom it is addressed and others who have been specifically 
> authorized to receive it. If you are not the intended recipient, you 
> are hereby notified that any dissemination, distribution or copying of 
> this communication is strictly prohibited. If you have received this 
> communication in error, or if any problems occur with transmission, 
> please contact sender. Thank you.