You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@sqoop.apache.org by Kathleen Ting <ka...@apache.org> on 2013/04/23 21:21:06 UTC

Re: sqoop job

Hi Jay, this use-case seems to be beyond the scope of Sqoop, which is meant
to just transfer data between a structured datastore and Hadoop. Including
user@sqoop.apache.org to solicit more opinions.

Regards, Kate

On Mon, Apr 22, 2013 at 11:04 PM, jaikumar krishna <ja...@gmail.com>wrote:

> Thanks Kate,
>
> My use case ::i am  doing  to do .
> I have two table of inputs Table1 and Table 2 . In Table 1(like master) i
> have *"25 lakhs" *records of  "*company name , address, city,state ,zip
> ,phone nember, fax ,Mailid,company website url* ".
>
> In Table_2   i have " *5 lakhs"* records of  *company name , address,
> city,state ,zip ,phone nember, fax ,Mailid,company website url* like
> Table1. i want to check Table2 recods match with Table1 for verifying
> (whether it's correct or not ).
>
> Before matching i have to put normalization's like below
>
> *Company name                                      Normalized _Company
> name*
> Century Tool & Gage           becomes   Century Tool and Gage
> News-Gazette Printing Co     =>            News Gazette Printing
> Punch Networks Inc               =>           Punch Networks
> Omni Print Inc                       =>            Omni Print
>
> for Address_1 column
> *Address_1             =>       Address_1_Normalized*
> 15 Sproat St          =>        15 Sproat Street
> 1 Preble Rd          =>        1 Preble Road
> 90 Everett Ave      =>        90 Everett Avenue
>
> Kindly check for attached excel sheet for* normalization of remaining
> fields *..(Both tables normalized before verifying )
>
> Then i have some condition for result accuracy by score those entities by
> matching
>
> *1.company name == 100 and  (address == 100 or phone number == 100) ) *
> * 2. ( company name>=75 and  address >=75  and city == 100  and  state ==
> 100 )*
>
> if any anyone satisfies i can put its verified one.
>
> in another case
>  *if company name and phone number did not matched with  Table1 which
> means i can add it in new entity (which means its not in Ttable1)*
>
> i have attached sample records of Table1 and table 2 and my current output
> (which includes scores of my current process without hadoop takes more and
> more time)
>
>
> i hope you  understand my usecase.
>
> The main problem is how can i compare each row having  6 fields (comp
> name, city ,street,state ,phone .mailid) with another table and get score
> and finally get max... i am totally frustrated. ...
>
> Thanks,
> Jay'
>
>
> On Tue, Apr 23, 2013 at 4:49 AM, Kathleen Ting <ka...@gmail.com>wrote:
>
>> Hi Jay, can you share your use-case behind verifying the table in
>> Sqoop rather than in HDFS? Generally speaking, you can verify if the
>> table transferred successfully by inspecting the file's contents via
>> issuing $ hadoop fs -cat <tablename>/part-m-00000
>>
>> You can also verify the return value from the Sqoop command ($ echo
>> $?), which should be 0.
>>
>> Regards, Kate
>>
>>
>> On Monday, April 22, 2013 10:19:20 AM UTC-7, jaikumar krishna wrote:
>>>
>>> hi,
>>>     how can i find the table is moved successfully  or not in sqoop(not
>>> in hdfs) ?
>>>
>>> Thanks,
>>> Jay'
>>>
>>  --
>>
>>
>>
>>
>
>  --
>
>
>
>

Re: sqoop job

Posted by Kathleen Ting <ka...@apache.org>.
[Including jaikumarvin@gmail.com]


On Tue, Apr 23, 2013 at 12:21 PM, Kathleen Ting <ka...@apache.org> wrote:

> Hi Jay, this use-case seems to be beyond the scope of Sqoop, which is
> meant to just transfer data between a structured datastore and
> Hadoop. Including user@sqoop.apache.org to solicit more opinions.
>
> Regards, Kate
>
>
> On Mon, Apr 22, 2013 at 11:04 PM, jaikumar krishna <ja...@gmail.com>wrote:
>
>> Thanks Kate,
>>
>> My use case ::i am  doing  to do .
>> I have two table of inputs Table1 and Table 2 . In Table 1(like master)
>> i have *"25 lakhs" *records of  "*company name , address, city,state
>> ,zip ,phone nember, fax ,Mailid,company website url* ".
>>
>> In Table_2   i have " *5 lakhs"* records of  *company name , address,
>> city,state ,zip ,phone nember, fax ,Mailid,company website url* like
>> Table1. i want to check Table2 recods match with Table1 for verifying
>> (whether it's correct or not ).
>>
>> Before matching i have to put normalization's like below
>>
>> *Company name                                      Normalized _Company
>> name*
>> Century Tool & Gage           becomes   Century Tool and Gage
>> News-Gazette Printing Co     =>            News Gazette Printing
>> Punch Networks Inc               =>           Punch Networks
>> Omni Print Inc                       =>            Omni Print
>>
>> for Address_1 column
>> *Address_1             =>       Address_1_Normalized*
>> 15 Sproat St          =>        15 Sproat Street
>> 1 Preble Rd          =>        1 Preble Road
>> 90 Everett Ave      =>        90 Everett Avenue
>>
>> Kindly check for attached excel sheet for* normalization of remaining
>> fields *..(Both tables normalized before verifying )
>>
>> Then i have some condition for result accuracy by score those entities by
>> matching
>>
>> *1.company name == 100 and  (address == 100 or phone number == 100) ) *
>> * 2. ( company name>=75 and  address >=75  and city == 100  and  state
>> == 100 )*
>>
>> if any anyone satisfies i can put its verified one.
>>
>> in another case
>>  *if company name and phone number did not matched with  Table1 which
>> means i can add it in new entity (which means its not in Ttable1)*
>>
>> i have attached sample records of Table1 and table 2 and my current
>> output (which includes scores of my current process without hadoop takes
>> more and more time)
>>
>>
>> i hope you  understand my usecase.
>>
>> The main problem is how can i compare each row having  6 fields (comp
>> name, city ,street,state ,phone .mailid) with another table and get score
>> and finally get max... i am totally frustrated. ...
>>
>> Thanks,
>> Jay'
>>
>>
>> On Tue, Apr 23, 2013 at 4:49 AM, Kathleen Ting <ka...@gmail.com>wrote:
>>
>>> Hi Jay, can you share your use-case behind verifying the table in
>>> Sqoop rather than in HDFS? Generally speaking, you can verify if the
>>> table transferred successfully by inspecting the file's contents via
>>> issuing $ hadoop fs -cat <tablename>/part-m-00000
>>>
>>> You can also verify the return value from the Sqoop command ($ echo
>>> $?), which should be 0.
>>>
>>> Regards, Kate
>>>
>>>
>>> On Monday, April 22, 2013 10:19:20 AM UTC-7, jaikumar krishna wrote:
>>>>
>>>> hi,
>>>>     how can i find the table is moved successfully  or not in sqoop(not
>>>> in hdfs) ?
>>>>
>>>> Thanks,
>>>> Jay'
>>>>
>>>  --
>>>
>>>
>>>
>>>
>>
>>  --
>>
>>
>>
>>
>
>