You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by unmesha sreeveni <un...@gmail.com> on 2015/01/15 07:06:55 UTC

How to partition a file to smaller size for performing KNN in hadoop mapreduce

In KNN like algorithm we need to load model Data into cache for predicting
the records.

Here is the example for KNN.


[image: Inline image 1]

So if the model will be a large file say1 or 2 GB we will be able to load
them into Distributed cache.

The one way is to split/partition the model Result into some files and
perform the distance calculation for all records in that file and then find
the min ditance and max occurance of classlabel and predict the outcome.

How can we parttion the file and perform the operation on these partition ?

ie  1 record <Distance> parttition1,partition2,....
     2nd record <Distance> parttition1,partition2,...

This is what came to my thought.

Is there any further way.

Any pointers would help me.

-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

I have 4 nodes and the replication factor is set to 3

On Wed, Jan 21, 2015 at 11:15 AM, Drake민영근 <dr...@nexr.com> wrote:

> Yes, almost same. I assume the most time spending part was copying model
> data from datanode which has model data to actual process node(tasktracker
> or nodemanager).
>
> How about the model data's replication factor? How many nodes do you have?
> If you have 4 or more nodes, you can increase replication with following
> command. I suggest the number equal to your datanodes, but first you should
> confirm the enough space in HDFS.
>
>
>    - hdfs dfs -setrep -w 6 /user/model/data
>
>
>
>
> Drake 민영근 Ph.D
>
> On Wed, Jan 21, 2015 at 2:12 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> Yes I tried the same Drake.
>>
>> I dont know if I understood your answer.
>>
>>  Instead of loading them into setup() through cache I read them directly
>> from HDFS in map section. and for each incoming record .I found the
>> distance between all the records in HDFS.
>> ie if R ans S are my dataset, R is the model data stored in HDFs
>> and when S taken for processing
>> S1-R(finding distance with whole R set)
>> S2-R
>>
>> But it is taking a long time as it needs to compute the distance.
>>
>> On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 <dr...@nexr.com> wrote:
>>
>>> In my suggestion, map or reduce tasks do not use distributed cache. They
>>> use file directly from HDFS with short circuit local read. Like a shared
>>> storage method, but almost every node has the data with high-replication
>>> factor.
>>>
>>> Drake 민영근 Ph.D
>>>
>>> On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <unmeshabiju@gmail.com
>>> > wrote:
>>>
>>>> But stil if the model is very large enough, how can we load them inti
>>>> Distributed cache or some thing like that.
>>>> Here is one source :
>>>> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
>>>> But it is confusing me
>>>>
>>>> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> How about this ? The large model data stay in HDFS but with many
>>>>> replications and MapReduce program read the model from HDFS. In theory, the
>>>>> replication factor of model data equals with number of data nodes and with
>>>>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>>>>> tasks read the model data in their own disks.
>>>>>
>>>>> In this way, maybe use too many usage of HDFS, but the annoying
>>>>> partition problem will be gone.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Drake 민영근 Ph.D
>>>>>
>>>>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <
>>>>> unmeshabiju@gmail.com> wrote:
>>>>>
>>>>>> Is there any way..
>>>>>> Waiting for a reply.I have posted the question every where..but none
>>>>>> is responding back.
>>>>>> I feel like this is the right place to ask doubts. As some of u may
>>>>>> came across the same issue and get stuck.
>>>>>>
>>>>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>>>>> unmeshabiju@gmail.com> wrote:
>>>>>>
>>>>>>> Yes, One of my friend is implemeting the same. I know global sharing
>>>>>>> of Data is not possible across Hadoop MapReduce. But I need to check if
>>>>>>> that can be done somehow in hadoop Mapreduce also. Because I found some
>>>>>>> papers in KNN hadoop also.
>>>>>>> And I trying to compare the performance too.
>>>>>>>
>>>>>>> Hope some pointers can help me.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <ted.dunning@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> have you considered implementing using something like spark?  That
>>>>>>>> could be much easier than raw map-reduce
>>>>>>>>
>>>>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>>>>> unmeshabiju@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>>>>> predicting the records.
>>>>>>>>>
>>>>>>>>> Here is the example for KNN.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [image: Inline image 1]
>>>>>>>>>
>>>>>>>>> So if the model will be a large file say1 or 2 GB we will be able
>>>>>>>>> to load them into Distributed cache.
>>>>>>>>>
>>>>>>>>> The one way is to split/partition the model Result into some files
>>>>>>>>> and perform the distance calculation for all records in that file and then
>>>>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>>>>> outcome.
>>>>>>>>>
>>>>>>>>> How can we parttion the file and perform the operation on these
>>>>>>>>> partition ?
>>>>>>>>>
>>>>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>>>>
>>>>>>>>> This is what came to my thought.
>>>>>>>>>
>>>>>>>>> Is there any further way.
>>>>>>>>>
>>>>>>>>> Any pointers would help me.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Thanks & Regards *
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Thanks & Regards *
>>>>>>>
>>>>>>>
>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Thanks & Regards *
>>>>>>
>>>>>>
>>>>>> *Unmesha Sreeveni U.B*
>>>>>> *Hadoop, Bigdata Developer*
>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

I have 4 nodes and the replication factor is set to 3

On Wed, Jan 21, 2015 at 11:15 AM, Drake민영근 <dr...@nexr.com> wrote:

> Yes, almost same. I assume the most time spending part was copying model
> data from datanode which has model data to actual process node(tasktracker
> or nodemanager).
>
> How about the model data's replication factor? How many nodes do you have?
> If you have 4 or more nodes, you can increase replication with following
> command. I suggest the number equal to your datanodes, but first you should
> confirm the enough space in HDFS.
>
>
>    - hdfs dfs -setrep -w 6 /user/model/data
>
>
>
>
> Drake 민영근 Ph.D
>
> On Wed, Jan 21, 2015 at 2:12 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> Yes I tried the same Drake.
>>
>> I dont know if I understood your answer.
>>
>>  Instead of loading them into setup() through cache I read them directly
>> from HDFS in map section. and for each incoming record .I found the
>> distance between all the records in HDFS.
>> ie if R ans S are my dataset, R is the model data stored in HDFs
>> and when S taken for processing
>> S1-R(finding distance with whole R set)
>> S2-R
>>
>> But it is taking a long time as it needs to compute the distance.
>>
>> On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 <dr...@nexr.com> wrote:
>>
>>> In my suggestion, map or reduce tasks do not use distributed cache. They
>>> use file directly from HDFS with short circuit local read. Like a shared
>>> storage method, but almost every node has the data with high-replication
>>> factor.
>>>
>>> Drake 민영근 Ph.D
>>>
>>> On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <unmeshabiju@gmail.com
>>> > wrote:
>>>
>>>> But stil if the model is very large enough, how can we load them inti
>>>> Distributed cache or some thing like that.
>>>> Here is one source :
>>>> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
>>>> But it is confusing me
>>>>
>>>> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> How about this ? The large model data stay in HDFS but with many
>>>>> replications and MapReduce program read the model from HDFS. In theory, the
>>>>> replication factor of model data equals with number of data nodes and with
>>>>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>>>>> tasks read the model data in their own disks.
>>>>>
>>>>> In this way, maybe use too many usage of HDFS, but the annoying
>>>>> partition problem will be gone.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Drake 민영근 Ph.D
>>>>>
>>>>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <
>>>>> unmeshabiju@gmail.com> wrote:
>>>>>
>>>>>> Is there any way..
>>>>>> Waiting for a reply.I have posted the question every where..but none
>>>>>> is responding back.
>>>>>> I feel like this is the right place to ask doubts. As some of u may
>>>>>> came across the same issue and get stuck.
>>>>>>
>>>>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>>>>> unmeshabiju@gmail.com> wrote:
>>>>>>
>>>>>>> Yes, One of my friend is implemeting the same. I know global sharing
>>>>>>> of Data is not possible across Hadoop MapReduce. But I need to check if
>>>>>>> that can be done somehow in hadoop Mapreduce also. Because I found some
>>>>>>> papers in KNN hadoop also.
>>>>>>> And I trying to compare the performance too.
>>>>>>>
>>>>>>> Hope some pointers can help me.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <ted.dunning@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> have you considered implementing using something like spark?  That
>>>>>>>> could be much easier than raw map-reduce
>>>>>>>>
>>>>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>>>>> unmeshabiju@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>>>>> predicting the records.
>>>>>>>>>
>>>>>>>>> Here is the example for KNN.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [image: Inline image 1]
>>>>>>>>>
>>>>>>>>> So if the model will be a large file say1 or 2 GB we will be able
>>>>>>>>> to load them into Distributed cache.
>>>>>>>>>
>>>>>>>>> The one way is to split/partition the model Result into some files
>>>>>>>>> and perform the distance calculation for all records in that file and then
>>>>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>>>>> outcome.
>>>>>>>>>
>>>>>>>>> How can we parttion the file and perform the operation on these
>>>>>>>>> partition ?
>>>>>>>>>
>>>>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>>>>
>>>>>>>>> This is what came to my thought.
>>>>>>>>>
>>>>>>>>> Is there any further way.
>>>>>>>>>
>>>>>>>>> Any pointers would help me.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Thanks & Regards *
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Thanks & Regards *
>>>>>>>
>>>>>>>
>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Thanks & Regards *
>>>>>>
>>>>>>
>>>>>> *Unmesha Sreeveni U.B*
>>>>>> *Hadoop, Bigdata Developer*
>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

I have 4 nodes and the replication factor is set to 3

On Wed, Jan 21, 2015 at 11:15 AM, Drake민영근 <dr...@nexr.com> wrote:

> Yes, almost same. I assume the most time spending part was copying model
> data from datanode which has model data to actual process node(tasktracker
> or nodemanager).
>
> How about the model data's replication factor? How many nodes do you have?
> If you have 4 or more nodes, you can increase replication with following
> command. I suggest the number equal to your datanodes, but first you should
> confirm the enough space in HDFS.
>
>
>    - hdfs dfs -setrep -w 6 /user/model/data
>
>
>
>
> Drake 민영근 Ph.D
>
> On Wed, Jan 21, 2015 at 2:12 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> Yes I tried the same Drake.
>>
>> I dont know if I understood your answer.
>>
>>  Instead of loading them into setup() through cache I read them directly
>> from HDFS in map section. and for each incoming record .I found the
>> distance between all the records in HDFS.
>> ie if R ans S are my dataset, R is the model data stored in HDFs
>> and when S taken for processing
>> S1-R(finding distance with whole R set)
>> S2-R
>>
>> But it is taking a long time as it needs to compute the distance.
>>
>> On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 <dr...@nexr.com> wrote:
>>
>>> In my suggestion, map or reduce tasks do not use distributed cache. They
>>> use file directly from HDFS with short circuit local read. Like a shared
>>> storage method, but almost every node has the data with high-replication
>>> factor.
>>>
>>> Drake 민영근 Ph.D
>>>
>>> On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <unmeshabiju@gmail.com
>>> > wrote:
>>>
>>>> But stil if the model is very large enough, how can we load them inti
>>>> Distributed cache or some thing like that.
>>>> Here is one source :
>>>> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
>>>> But it is confusing me
>>>>
>>>> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> How about this ? The large model data stay in HDFS but with many
>>>>> replications and MapReduce program read the model from HDFS. In theory, the
>>>>> replication factor of model data equals with number of data nodes and with
>>>>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>>>>> tasks read the model data in their own disks.
>>>>>
>>>>> In this way, maybe use too many usage of HDFS, but the annoying
>>>>> partition problem will be gone.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Drake 민영근 Ph.D
>>>>>
>>>>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <
>>>>> unmeshabiju@gmail.com> wrote:
>>>>>
>>>>>> Is there any way..
>>>>>> Waiting for a reply.I have posted the question every where..but none
>>>>>> is responding back.
>>>>>> I feel like this is the right place to ask doubts. As some of u may
>>>>>> came across the same issue and get stuck.
>>>>>>
>>>>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>>>>> unmeshabiju@gmail.com> wrote:
>>>>>>
>>>>>>> Yes, One of my friend is implemeting the same. I know global sharing
>>>>>>> of Data is not possible across Hadoop MapReduce. But I need to check if
>>>>>>> that can be done somehow in hadoop Mapreduce also. Because I found some
>>>>>>> papers in KNN hadoop also.
>>>>>>> And I trying to compare the performance too.
>>>>>>>
>>>>>>> Hope some pointers can help me.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <ted.dunning@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> have you considered implementing using something like spark?  That
>>>>>>>> could be much easier than raw map-reduce
>>>>>>>>
>>>>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>>>>> unmeshabiju@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>>>>> predicting the records.
>>>>>>>>>
>>>>>>>>> Here is the example for KNN.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [image: Inline image 1]
>>>>>>>>>
>>>>>>>>> So if the model will be a large file say1 or 2 GB we will be able
>>>>>>>>> to load them into Distributed cache.
>>>>>>>>>
>>>>>>>>> The one way is to split/partition the model Result into some files
>>>>>>>>> and perform the distance calculation for all records in that file and then
>>>>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>>>>> outcome.
>>>>>>>>>
>>>>>>>>> How can we parttion the file and perform the operation on these
>>>>>>>>> partition ?
>>>>>>>>>
>>>>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>>>>
>>>>>>>>> This is what came to my thought.
>>>>>>>>>
>>>>>>>>> Is there any further way.
>>>>>>>>>
>>>>>>>>> Any pointers would help me.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Thanks & Regards *
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Thanks & Regards *
>>>>>>>
>>>>>>>
>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Thanks & Regards *
>>>>>>
>>>>>>
>>>>>> *Unmesha Sreeveni U.B*
>>>>>> *Hadoop, Bigdata Developer*
>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

I have 4 nodes and the replication factor is set to 3

On Wed, Jan 21, 2015 at 11:15 AM, Drake민영근 <dr...@nexr.com> wrote:

> Yes, almost same. I assume the most time spending part was copying model
> data from datanode which has model data to actual process node(tasktracker
> or nodemanager).
>
> How about the model data's replication factor? How many nodes do you have?
> If you have 4 or more nodes, you can increase replication with following
> command. I suggest the number equal to your datanodes, but first you should
> confirm the enough space in HDFS.
>
>
>    - hdfs dfs -setrep -w 6 /user/model/data
>
>
>
>
> Drake 민영근 Ph.D
>
> On Wed, Jan 21, 2015 at 2:12 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> Yes I tried the same Drake.
>>
>> I dont know if I understood your answer.
>>
>>  Instead of loading them into setup() through cache I read them directly
>> from HDFS in map section. and for each incoming record .I found the
>> distance between all the records in HDFS.
>> ie if R ans S are my dataset, R is the model data stored in HDFs
>> and when S taken for processing
>> S1-R(finding distance with whole R set)
>> S2-R
>>
>> But it is taking a long time as it needs to compute the distance.
>>
>> On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 <dr...@nexr.com> wrote:
>>
>>> In my suggestion, map or reduce tasks do not use distributed cache. They
>>> use file directly from HDFS with short circuit local read. Like a shared
>>> storage method, but almost every node has the data with high-replication
>>> factor.
>>>
>>> Drake 민영근 Ph.D
>>>
>>> On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <unmeshabiju@gmail.com
>>> > wrote:
>>>
>>>> But stil if the model is very large enough, how can we load them inti
>>>> Distributed cache or some thing like that.
>>>> Here is one source :
>>>> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
>>>> But it is confusing me
>>>>
>>>> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> How about this ? The large model data stay in HDFS but with many
>>>>> replications and MapReduce program read the model from HDFS. In theory, the
>>>>> replication factor of model data equals with number of data nodes and with
>>>>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>>>>> tasks read the model data in their own disks.
>>>>>
>>>>> In this way, maybe use too many usage of HDFS, but the annoying
>>>>> partition problem will be gone.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Drake 민영근 Ph.D
>>>>>
>>>>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <
>>>>> unmeshabiju@gmail.com> wrote:
>>>>>
>>>>>> Is there any way..
>>>>>> Waiting for a reply.I have posted the question every where..but none
>>>>>> is responding back.
>>>>>> I feel like this is the right place to ask doubts. As some of u may
>>>>>> came across the same issue and get stuck.
>>>>>>
>>>>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>>>>> unmeshabiju@gmail.com> wrote:
>>>>>>
>>>>>>> Yes, One of my friend is implemeting the same. I know global sharing
>>>>>>> of Data is not possible across Hadoop MapReduce. But I need to check if
>>>>>>> that can be done somehow in hadoop Mapreduce also. Because I found some
>>>>>>> papers in KNN hadoop also.
>>>>>>> And I trying to compare the performance too.
>>>>>>>
>>>>>>> Hope some pointers can help me.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <ted.dunning@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> have you considered implementing using something like spark?  That
>>>>>>>> could be much easier than raw map-reduce
>>>>>>>>
>>>>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>>>>> unmeshabiju@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>>>>> predicting the records.
>>>>>>>>>
>>>>>>>>> Here is the example for KNN.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [image: Inline image 1]
>>>>>>>>>
>>>>>>>>> So if the model will be a large file say1 or 2 GB we will be able
>>>>>>>>> to load them into Distributed cache.
>>>>>>>>>
>>>>>>>>> The one way is to split/partition the model Result into some files
>>>>>>>>> and perform the distance calculation for all records in that file and then
>>>>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>>>>> outcome.
>>>>>>>>>
>>>>>>>>> How can we parttion the file and perform the operation on these
>>>>>>>>> partition ?
>>>>>>>>>
>>>>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>>>>
>>>>>>>>> This is what came to my thought.
>>>>>>>>>
>>>>>>>>> Is there any further way.
>>>>>>>>>
>>>>>>>>> Any pointers would help me.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Thanks & Regards *
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Thanks & Regards *
>>>>>>>
>>>>>>>
>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Thanks & Regards *
>>>>>>
>>>>>>
>>>>>> *Unmesha Sreeveni U.B*
>>>>>> *Hadoop, Bigdata Developer*
>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by Drake민영근 <dr...@nexr.com>.

Yes, almost same. I assume the most time spending part was copying model
data from datanode which has model data to actual process node(tasktracker
or nodemanager).

How about the model data's replication factor? How many nodes do you have?
If you have 4 or more nodes, you can increase replication with following
command. I suggest the number equal to your datanodes, but first you should
confirm the enough space in HDFS.


   - hdfs dfs -setrep -w 6 /user/model/data




Drake 민영근 Ph.D

On Wed, Jan 21, 2015 at 2:12 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Yes I tried the same Drake.
>
> I dont know if I understood your answer.
>
>  Instead of loading them into setup() through cache I read them directly
> from HDFS in map section. and for each incoming record .I found the
> distance between all the records in HDFS.
> ie if R ans S are my dataset, R is the model data stored in HDFs
> and when S taken for processing
> S1-R(finding distance with whole R set)
> S2-R
>
> But it is taking a long time as it needs to compute the distance.
>
> On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 <dr...@nexr.com> wrote:
>
>> In my suggestion, map or reduce tasks do not use distributed cache. They
>> use file directly from HDFS with short circuit local read. Like a shared
>> storage method, but almost every node has the data with high-replication
>> factor.
>>
>> Drake 민영근 Ph.D
>>
>> On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <un...@gmail.com>
>> wrote:
>>
>>> But stil if the model is very large enough, how can we load them inti
>>> Distributed cache or some thing like that.
>>> Here is one source :
>>> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
>>> But it is confusing me
>>>
>>> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> How about this ? The large model data stay in HDFS but with many
>>>> replications and MapReduce program read the model from HDFS. In theory, the
>>>> replication factor of model data equals with number of data nodes and with
>>>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>>>> tasks read the model data in their own disks.
>>>>
>>>> In this way, maybe use too many usage of HDFS, but the annoying
>>>> partition problem will be gone.
>>>>
>>>> Thanks
>>>>
>>>> Drake 민영근 Ph.D
>>>>
>>>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <
>>>> unmeshabiju@gmail.com> wrote:
>>>>
>>>>> Is there any way..
>>>>> Waiting for a reply.I have posted the question every where..but none
>>>>> is responding back.
>>>>> I feel like this is the right place to ask doubts. As some of u may
>>>>> came across the same issue and get stuck.
>>>>>
>>>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>>>> unmeshabiju@gmail.com> wrote:
>>>>>
>>>>>> Yes, One of my friend is implemeting the same. I know global sharing
>>>>>> of Data is not possible across Hadoop MapReduce. But I need to check if
>>>>>> that can be done somehow in hadoop Mapreduce also. Because I found some
>>>>>> papers in KNN hadoop also.
>>>>>> And I trying to compare the performance too.
>>>>>>
>>>>>> Hope some pointers can help me.
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> have you considered implementing using something like spark?  That
>>>>>>> could be much easier than raw map-reduce
>>>>>>>
>>>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>>>> unmeshabiju@gmail.com> wrote:
>>>>>>>
>>>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>>>> predicting the records.
>>>>>>>>
>>>>>>>> Here is the example for KNN.
>>>>>>>>
>>>>>>>>
>>>>>>>> [image: Inline image 1]
>>>>>>>>
>>>>>>>> So if the model will be a large file say1 or 2 GB we will be able
>>>>>>>> to load them into Distributed cache.
>>>>>>>>
>>>>>>>> The one way is to split/partition the model Result into some files
>>>>>>>> and perform the distance calculation for all records in that file and then
>>>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>>>> outcome.
>>>>>>>>
>>>>>>>> How can we parttion the file and perform the operation on these
>>>>>>>> partition ?
>>>>>>>>
>>>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>>>
>>>>>>>> This is what came to my thought.
>>>>>>>>
>>>>>>>> Is there any further way.
>>>>>>>>
>>>>>>>> Any pointers would help me.
>>>>>>>>
>>>>>>>> --
>>>>>>>> *Thanks & Regards *
>>>>>>>>
>>>>>>>>
>>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Thanks & Regards *
>>>>>>
>>>>>>
>>>>>> *Unmesha Sreeveni U.B*
>>>>>> *Hadoop, Bigdata Developer*
>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Thanks & Regards *
>>>>>
>>>>>
>>>>> *Unmesha Sreeveni U.B*
>>>>> *Hadoop, Bigdata Developer*
>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by Drake민영근 <dr...@nexr.com>.

Yes, almost same. I assume the most time spending part was copying model
data from datanode which has model data to actual process node(tasktracker
or nodemanager).

How about the model data's replication factor? How many nodes do you have?
If you have 4 or more nodes, you can increase replication with following
command. I suggest the number equal to your datanodes, but first you should
confirm the enough space in HDFS.


   - hdfs dfs -setrep -w 6 /user/model/data




Drake 민영근 Ph.D

On Wed, Jan 21, 2015 at 2:12 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Yes I tried the same Drake.
>
> I dont know if I understood your answer.
>
>  Instead of loading them into setup() through cache I read them directly
> from HDFS in map section. and for each incoming record .I found the
> distance between all the records in HDFS.
> ie if R ans S are my dataset, R is the model data stored in HDFs
> and when S taken for processing
> S1-R(finding distance with whole R set)
> S2-R
>
> But it is taking a long time as it needs to compute the distance.
>
> On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 <dr...@nexr.com> wrote:
>
>> In my suggestion, map or reduce tasks do not use distributed cache. They
>> use file directly from HDFS with short circuit local read. Like a shared
>> storage method, but almost every node has the data with high-replication
>> factor.
>>
>> Drake 민영근 Ph.D
>>
>> On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <un...@gmail.com>
>> wrote:
>>
>>> But stil if the model is very large enough, how can we load them inti
>>> Distributed cache or some thing like that.
>>> Here is one source :
>>> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
>>> But it is confusing me
>>>
>>> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> How about this ? The large model data stay in HDFS but with many
>>>> replications and MapReduce program read the model from HDFS. In theory, the
>>>> replication factor of model data equals with number of data nodes and with
>>>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>>>> tasks read the model data in their own disks.
>>>>
>>>> In this way, maybe use too many usage of HDFS, but the annoying
>>>> partition problem will be gone.
>>>>
>>>> Thanks
>>>>
>>>> Drake 민영근 Ph.D
>>>>
>>>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <
>>>> unmeshabiju@gmail.com> wrote:
>>>>
>>>>> Is there any way..
>>>>> Waiting for a reply.I have posted the question every where..but none
>>>>> is responding back.
>>>>> I feel like this is the right place to ask doubts. As some of u may
>>>>> came across the same issue and get stuck.
>>>>>
>>>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>>>> unmeshabiju@gmail.com> wrote:
>>>>>
>>>>>> Yes, One of my friend is implemeting the same. I know global sharing
>>>>>> of Data is not possible across Hadoop MapReduce. But I need to check if
>>>>>> that can be done somehow in hadoop Mapreduce also. Because I found some
>>>>>> papers in KNN hadoop also.
>>>>>> And I trying to compare the performance too.
>>>>>>
>>>>>> Hope some pointers can help me.
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> have you considered implementing using something like spark?  That
>>>>>>> could be much easier than raw map-reduce
>>>>>>>
>>>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>>>> unmeshabiju@gmail.com> wrote:
>>>>>>>
>>>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>>>> predicting the records.
>>>>>>>>
>>>>>>>> Here is the example for KNN.
>>>>>>>>
>>>>>>>>
>>>>>>>> [image: Inline image 1]
>>>>>>>>
>>>>>>>> So if the model will be a large file say1 or 2 GB we will be able
>>>>>>>> to load them into Distributed cache.
>>>>>>>>
>>>>>>>> The one way is to split/partition the model Result into some files
>>>>>>>> and perform the distance calculation for all records in that file and then
>>>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>>>> outcome.
>>>>>>>>
>>>>>>>> How can we parttion the file and perform the operation on these
>>>>>>>> partition ?
>>>>>>>>
>>>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>>>
>>>>>>>> This is what came to my thought.
>>>>>>>>
>>>>>>>> Is there any further way.
>>>>>>>>
>>>>>>>> Any pointers would help me.
>>>>>>>>
>>>>>>>> --
>>>>>>>> *Thanks & Regards *
>>>>>>>>
>>>>>>>>
>>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Thanks & Regards *
>>>>>>
>>>>>>
>>>>>> *Unmesha Sreeveni U.B*
>>>>>> *Hadoop, Bigdata Developer*
>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Thanks & Regards *
>>>>>
>>>>>
>>>>> *Unmesha Sreeveni U.B*
>>>>> *Hadoop, Bigdata Developer*
>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by Drake민영근 <dr...@nexr.com>.

Yes, almost same. I assume the most time spending part was copying model
data from datanode which has model data to actual process node(tasktracker
or nodemanager).

How about the model data's replication factor? How many nodes do you have?
If you have 4 or more nodes, you can increase replication with following
command. I suggest the number equal to your datanodes, but first you should
confirm the enough space in HDFS.


   - hdfs dfs -setrep -w 6 /user/model/data




Drake 민영근 Ph.D

On Wed, Jan 21, 2015 at 2:12 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Yes I tried the same Drake.
>
> I dont know if I understood your answer.
>
>  Instead of loading them into setup() through cache I read them directly
> from HDFS in map section. and for each incoming record .I found the
> distance between all the records in HDFS.
> ie if R ans S are my dataset, R is the model data stored in HDFs
> and when S taken for processing
> S1-R(finding distance with whole R set)
> S2-R
>
> But it is taking a long time as it needs to compute the distance.
>
> On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 <dr...@nexr.com> wrote:
>
>> In my suggestion, map or reduce tasks do not use distributed cache. They
>> use file directly from HDFS with short circuit local read. Like a shared
>> storage method, but almost every node has the data with high-replication
>> factor.
>>
>> Drake 민영근 Ph.D
>>
>> On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <un...@gmail.com>
>> wrote:
>>
>>> But stil if the model is very large enough, how can we load them inti
>>> Distributed cache or some thing like that.
>>> Here is one source :
>>> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
>>> But it is confusing me
>>>
>>> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> How about this ? The large model data stay in HDFS but with many
>>>> replications and MapReduce program read the model from HDFS. In theory, the
>>>> replication factor of model data equals with number of data nodes and with
>>>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>>>> tasks read the model data in their own disks.
>>>>
>>>> In this way, maybe use too many usage of HDFS, but the annoying
>>>> partition problem will be gone.
>>>>
>>>> Thanks
>>>>
>>>> Drake 민영근 Ph.D
>>>>
>>>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <
>>>> unmeshabiju@gmail.com> wrote:
>>>>
>>>>> Is there any way..
>>>>> Waiting for a reply.I have posted the question every where..but none
>>>>> is responding back.
>>>>> I feel like this is the right place to ask doubts. As some of u may
>>>>> came across the same issue and get stuck.
>>>>>
>>>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>>>> unmeshabiju@gmail.com> wrote:
>>>>>
>>>>>> Yes, One of my friend is implemeting the same. I know global sharing
>>>>>> of Data is not possible across Hadoop MapReduce. But I need to check if
>>>>>> that can be done somehow in hadoop Mapreduce also. Because I found some
>>>>>> papers in KNN hadoop also.
>>>>>> And I trying to compare the performance too.
>>>>>>
>>>>>> Hope some pointers can help me.
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> have you considered implementing using something like spark?  That
>>>>>>> could be much easier than raw map-reduce
>>>>>>>
>>>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>>>> unmeshabiju@gmail.com> wrote:
>>>>>>>
>>>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>>>> predicting the records.
>>>>>>>>
>>>>>>>> Here is the example for KNN.
>>>>>>>>
>>>>>>>>
>>>>>>>> [image: Inline image 1]
>>>>>>>>
>>>>>>>> So if the model will be a large file say1 or 2 GB we will be able
>>>>>>>> to load them into Distributed cache.
>>>>>>>>
>>>>>>>> The one way is to split/partition the model Result into some files
>>>>>>>> and perform the distance calculation for all records in that file and then
>>>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>>>> outcome.
>>>>>>>>
>>>>>>>> How can we parttion the file and perform the operation on these
>>>>>>>> partition ?
>>>>>>>>
>>>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>>>
>>>>>>>> This is what came to my thought.
>>>>>>>>
>>>>>>>> Is there any further way.
>>>>>>>>
>>>>>>>> Any pointers would help me.
>>>>>>>>
>>>>>>>> --
>>>>>>>> *Thanks & Regards *
>>>>>>>>
>>>>>>>>
>>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Thanks & Regards *
>>>>>>
>>>>>>
>>>>>> *Unmesha Sreeveni U.B*
>>>>>> *Hadoop, Bigdata Developer*
>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Thanks & Regards *
>>>>>
>>>>>
>>>>> *Unmesha Sreeveni U.B*
>>>>> *Hadoop, Bigdata Developer*
>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by Drake민영근 <dr...@nexr.com>.

Yes, almost same. I assume the most time spending part was copying model
data from datanode which has model data to actual process node(tasktracker
or nodemanager).

How about the model data's replication factor? How many nodes do you have?
If you have 4 or more nodes, you can increase replication with following
command. I suggest the number equal to your datanodes, but first you should
confirm the enough space in HDFS.


   - hdfs dfs -setrep -w 6 /user/model/data




Drake 민영근 Ph.D

On Wed, Jan 21, 2015 at 2:12 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Yes I tried the same Drake.
>
> I dont know if I understood your answer.
>
>  Instead of loading them into setup() through cache I read them directly
> from HDFS in map section. and for each incoming record .I found the
> distance between all the records in HDFS.
> ie if R ans S are my dataset, R is the model data stored in HDFs
> and when S taken for processing
> S1-R(finding distance with whole R set)
> S2-R
>
> But it is taking a long time as it needs to compute the distance.
>
> On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 <dr...@nexr.com> wrote:
>
>> In my suggestion, map or reduce tasks do not use distributed cache. They
>> use file directly from HDFS with short circuit local read. Like a shared
>> storage method, but almost every node has the data with high-replication
>> factor.
>>
>> Drake 민영근 Ph.D
>>
>> On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <un...@gmail.com>
>> wrote:
>>
>>> But stil if the model is very large enough, how can we load them inti
>>> Distributed cache or some thing like that.
>>> Here is one source :
>>> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
>>> But it is confusing me
>>>
>>> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> How about this ? The large model data stay in HDFS but with many
>>>> replications and MapReduce program read the model from HDFS. In theory, the
>>>> replication factor of model data equals with number of data nodes and with
>>>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>>>> tasks read the model data in their own disks.
>>>>
>>>> In this way, maybe use too many usage of HDFS, but the annoying
>>>> partition problem will be gone.
>>>>
>>>> Thanks
>>>>
>>>> Drake 민영근 Ph.D
>>>>
>>>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <
>>>> unmeshabiju@gmail.com> wrote:
>>>>
>>>>> Is there any way..
>>>>> Waiting for a reply.I have posted the question every where..but none
>>>>> is responding back.
>>>>> I feel like this is the right place to ask doubts. As some of u may
>>>>> came across the same issue and get stuck.
>>>>>
>>>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>>>> unmeshabiju@gmail.com> wrote:
>>>>>
>>>>>> Yes, One of my friend is implemeting the same. I know global sharing
>>>>>> of Data is not possible across Hadoop MapReduce. But I need to check if
>>>>>> that can be done somehow in hadoop Mapreduce also. Because I found some
>>>>>> papers in KNN hadoop also.
>>>>>> And I trying to compare the performance too.
>>>>>>
>>>>>> Hope some pointers can help me.
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> have you considered implementing using something like spark?  That
>>>>>>> could be much easier than raw map-reduce
>>>>>>>
>>>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>>>> unmeshabiju@gmail.com> wrote:
>>>>>>>
>>>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>>>> predicting the records.
>>>>>>>>
>>>>>>>> Here is the example for KNN.
>>>>>>>>
>>>>>>>>
>>>>>>>> [image: Inline image 1]
>>>>>>>>
>>>>>>>> So if the model will be a large file say1 or 2 GB we will be able
>>>>>>>> to load them into Distributed cache.
>>>>>>>>
>>>>>>>> The one way is to split/partition the model Result into some files
>>>>>>>> and perform the distance calculation for all records in that file and then
>>>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>>>> outcome.
>>>>>>>>
>>>>>>>> How can we parttion the file and perform the operation on these
>>>>>>>> partition ?
>>>>>>>>
>>>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>>>
>>>>>>>> This is what came to my thought.
>>>>>>>>
>>>>>>>> Is there any further way.
>>>>>>>>
>>>>>>>> Any pointers would help me.
>>>>>>>>
>>>>>>>> --
>>>>>>>> *Thanks & Regards *
>>>>>>>>
>>>>>>>>
>>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Thanks & Regards *
>>>>>>
>>>>>>
>>>>>> *Unmesha Sreeveni U.B*
>>>>>> *Hadoop, Bigdata Developer*
>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Thanks & Regards *
>>>>>
>>>>>
>>>>> *Unmesha Sreeveni U.B*
>>>>> *Hadoop, Bigdata Developer*
>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

Yes I tried the same Drake.

I dont know if I understood your answer.

 Instead of loading them into setup() through cache I read them directly
from HDFS in map section. and for each incoming record .I found the
distance between all the records in HDFS.
ie if R ans S are my dataset, R is the model data stored in HDFs
and when S taken for processing
S1-R(finding distance with whole R set)
S2-R

But it is taking a long time as it needs to compute the distance.

On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 <dr...@nexr.com> wrote:

> In my suggestion, map or reduce tasks do not use distributed cache. They
> use file directly from HDFS with short circuit local read. Like a shared
> storage method, but almost every node has the data with high-replication
> factor.
>
> Drake 민영근 Ph.D
>
> On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> But stil if the model is very large enough, how can we load them inti
>> Distributed cache or some thing like that.
>> Here is one source :
>> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
>> But it is confusing me
>>
>> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:
>>
>>> Hi,
>>>
>>> How about this ? The large model data stay in HDFS but with many
>>> replications and MapReduce program read the model from HDFS. In theory, the
>>> replication factor of model data equals with number of data nodes and with
>>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>>> tasks read the model data in their own disks.
>>>
>>> In this way, maybe use too many usage of HDFS, but the annoying
>>> partition problem will be gone.
>>>
>>> Thanks
>>>
>>> Drake 민영근 Ph.D
>>>
>>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <unmeshabiju@gmail.com
>>> > wrote:
>>>
>>>> Is there any way..
>>>> Waiting for a reply.I have posted the question every where..but none is
>>>> responding back.
>>>> I feel like this is the right place to ask doubts. As some of u may
>>>> came across the same issue and get stuck.
>>>>
>>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>>> unmeshabiju@gmail.com> wrote:
>>>>
>>>>> Yes, One of my friend is implemeting the same. I know global sharing
>>>>> of Data is not possible across Hadoop MapReduce. But I need to check if
>>>>> that can be done somehow in hadoop Mapreduce also. Because I found some
>>>>> papers in KNN hadoop also.
>>>>> And I trying to compare the performance too.
>>>>>
>>>>> Hope some pointers can help me.
>>>>>
>>>>>
>>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> have you considered implementing using something like spark?  That
>>>>>> could be much easier than raw map-reduce
>>>>>>
>>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>>> unmeshabiju@gmail.com> wrote:
>>>>>>
>>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>>> predicting the records.
>>>>>>>
>>>>>>> Here is the example for KNN.
>>>>>>>
>>>>>>>
>>>>>>> [image: Inline image 1]
>>>>>>>
>>>>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>>>>> load them into Distributed cache.
>>>>>>>
>>>>>>> The one way is to split/partition the model Result into some files
>>>>>>> and perform the distance calculation for all records in that file and then
>>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>>> outcome.
>>>>>>>
>>>>>>> How can we parttion the file and perform the operation on these
>>>>>>> partition ?
>>>>>>>
>>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>>
>>>>>>> This is what came to my thought.
>>>>>>>
>>>>>>> Is there any further way.
>>>>>>>
>>>>>>> Any pointers would help me.
>>>>>>>
>>>>>>> --
>>>>>>> *Thanks & Regards *
>>>>>>>
>>>>>>>
>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Thanks & Regards *
>>>>>
>>>>>
>>>>> *Unmesha Sreeveni U.B*
>>>>> *Hadoop, Bigdata Developer*
>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

Yes I tried the same Drake.

I dont know if I understood your answer.

 Instead of loading them into setup() through cache I read them directly
from HDFS in map section. and for each incoming record .I found the
distance between all the records in HDFS.
ie if R ans S are my dataset, R is the model data stored in HDFs
and when S taken for processing
S1-R(finding distance with whole R set)
S2-R

But it is taking a long time as it needs to compute the distance.

On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 <dr...@nexr.com> wrote:

> In my suggestion, map or reduce tasks do not use distributed cache. They
> use file directly from HDFS with short circuit local read. Like a shared
> storage method, but almost every node has the data with high-replication
> factor.
>
> Drake 민영근 Ph.D
>
> On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> But stil if the model is very large enough, how can we load them inti
>> Distributed cache or some thing like that.
>> Here is one source :
>> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
>> But it is confusing me
>>
>> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:
>>
>>> Hi,
>>>
>>> How about this ? The large model data stay in HDFS but with many
>>> replications and MapReduce program read the model from HDFS. In theory, the
>>> replication factor of model data equals with number of data nodes and with
>>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>>> tasks read the model data in their own disks.
>>>
>>> In this way, maybe use too many usage of HDFS, but the annoying
>>> partition problem will be gone.
>>>
>>> Thanks
>>>
>>> Drake 민영근 Ph.D
>>>
>>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <unmeshabiju@gmail.com
>>> > wrote:
>>>
>>>> Is there any way..
>>>> Waiting for a reply.I have posted the question every where..but none is
>>>> responding back.
>>>> I feel like this is the right place to ask doubts. As some of u may
>>>> came across the same issue and get stuck.
>>>>
>>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>>> unmeshabiju@gmail.com> wrote:
>>>>
>>>>> Yes, One of my friend is implemeting the same. I know global sharing
>>>>> of Data is not possible across Hadoop MapReduce. But I need to check if
>>>>> that can be done somehow in hadoop Mapreduce also. Because I found some
>>>>> papers in KNN hadoop also.
>>>>> And I trying to compare the performance too.
>>>>>
>>>>> Hope some pointers can help me.
>>>>>
>>>>>
>>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> have you considered implementing using something like spark?  That
>>>>>> could be much easier than raw map-reduce
>>>>>>
>>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>>> unmeshabiju@gmail.com> wrote:
>>>>>>
>>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>>> predicting the records.
>>>>>>>
>>>>>>> Here is the example for KNN.
>>>>>>>
>>>>>>>
>>>>>>> [image: Inline image 1]
>>>>>>>
>>>>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>>>>> load them into Distributed cache.
>>>>>>>
>>>>>>> The one way is to split/partition the model Result into some files
>>>>>>> and perform the distance calculation for all records in that file and then
>>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>>> outcome.
>>>>>>>
>>>>>>> How can we parttion the file and perform the operation on these
>>>>>>> partition ?
>>>>>>>
>>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>>
>>>>>>> This is what came to my thought.
>>>>>>>
>>>>>>> Is there any further way.
>>>>>>>
>>>>>>> Any pointers would help me.
>>>>>>>
>>>>>>> --
>>>>>>> *Thanks & Regards *
>>>>>>>
>>>>>>>
>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Thanks & Regards *
>>>>>
>>>>>
>>>>> *Unmesha Sreeveni U.B*
>>>>> *Hadoop, Bigdata Developer*
>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

Yes I tried the same Drake.

I dont know if I understood your answer.

 Instead of loading them into setup() through cache I read them directly
from HDFS in map section. and for each incoming record .I found the
distance between all the records in HDFS.
ie if R ans S are my dataset, R is the model data stored in HDFs
and when S taken for processing
S1-R(finding distance with whole R set)
S2-R

But it is taking a long time as it needs to compute the distance.

On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 <dr...@nexr.com> wrote:

> In my suggestion, map or reduce tasks do not use distributed cache. They
> use file directly from HDFS with short circuit local read. Like a shared
> storage method, but almost every node has the data with high-replication
> factor.
>
> Drake 민영근 Ph.D
>
> On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> But stil if the model is very large enough, how can we load them inti
>> Distributed cache or some thing like that.
>> Here is one source :
>> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
>> But it is confusing me
>>
>> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:
>>
>>> Hi,
>>>
>>> How about this ? The large model data stay in HDFS but with many
>>> replications and MapReduce program read the model from HDFS. In theory, the
>>> replication factor of model data equals with number of data nodes and with
>>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>>> tasks read the model data in their own disks.
>>>
>>> In this way, maybe use too many usage of HDFS, but the annoying
>>> partition problem will be gone.
>>>
>>> Thanks
>>>
>>> Drake 민영근 Ph.D
>>>
>>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <unmeshabiju@gmail.com
>>> > wrote:
>>>
>>>> Is there any way..
>>>> Waiting for a reply.I have posted the question every where..but none is
>>>> responding back.
>>>> I feel like this is the right place to ask doubts. As some of u may
>>>> came across the same issue and get stuck.
>>>>
>>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>>> unmeshabiju@gmail.com> wrote:
>>>>
>>>>> Yes, One of my friend is implemeting the same. I know global sharing
>>>>> of Data is not possible across Hadoop MapReduce. But I need to check if
>>>>> that can be done somehow in hadoop Mapreduce also. Because I found some
>>>>> papers in KNN hadoop also.
>>>>> And I trying to compare the performance too.
>>>>>
>>>>> Hope some pointers can help me.
>>>>>
>>>>>
>>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> have you considered implementing using something like spark?  That
>>>>>> could be much easier than raw map-reduce
>>>>>>
>>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>>> unmeshabiju@gmail.com> wrote:
>>>>>>
>>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>>> predicting the records.
>>>>>>>
>>>>>>> Here is the example for KNN.
>>>>>>>
>>>>>>>
>>>>>>> [image: Inline image 1]
>>>>>>>
>>>>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>>>>> load them into Distributed cache.
>>>>>>>
>>>>>>> The one way is to split/partition the model Result into some files
>>>>>>> and perform the distance calculation for all records in that file and then
>>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>>> outcome.
>>>>>>>
>>>>>>> How can we parttion the file and perform the operation on these
>>>>>>> partition ?
>>>>>>>
>>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>>
>>>>>>> This is what came to my thought.
>>>>>>>
>>>>>>> Is there any further way.
>>>>>>>
>>>>>>> Any pointers would help me.
>>>>>>>
>>>>>>> --
>>>>>>> *Thanks & Regards *
>>>>>>>
>>>>>>>
>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Thanks & Regards *
>>>>>
>>>>>
>>>>> *Unmesha Sreeveni U.B*
>>>>> *Hadoop, Bigdata Developer*
>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

Yes I tried the same Drake.

I dont know if I understood your answer.

 Instead of loading them into setup() through cache I read them directly
from HDFS in map section. and for each incoming record .I found the
distance between all the records in HDFS.
ie if R ans S are my dataset, R is the model data stored in HDFs
and when S taken for processing
S1-R(finding distance with whole R set)
S2-R

But it is taking a long time as it needs to compute the distance.

On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 <dr...@nexr.com> wrote:

> In my suggestion, map or reduce tasks do not use distributed cache. They
> use file directly from HDFS with short circuit local read. Like a shared
> storage method, but almost every node has the data with high-replication
> factor.
>
> Drake 민영근 Ph.D
>
> On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> But stil if the model is very large enough, how can we load them inti
>> Distributed cache or some thing like that.
>> Here is one source :
>> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
>> But it is confusing me
>>
>> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:
>>
>>> Hi,
>>>
>>> How about this ? The large model data stay in HDFS but with many
>>> replications and MapReduce program read the model from HDFS. In theory, the
>>> replication factor of model data equals with number of data nodes and with
>>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>>> tasks read the model data in their own disks.
>>>
>>> In this way, maybe use too many usage of HDFS, but the annoying
>>> partition problem will be gone.
>>>
>>> Thanks
>>>
>>> Drake 민영근 Ph.D
>>>
>>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <unmeshabiju@gmail.com
>>> > wrote:
>>>
>>>> Is there any way..
>>>> Waiting for a reply.I have posted the question every where..but none is
>>>> responding back.
>>>> I feel like this is the right place to ask doubts. As some of u may
>>>> came across the same issue and get stuck.
>>>>
>>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>>> unmeshabiju@gmail.com> wrote:
>>>>
>>>>> Yes, One of my friend is implemeting the same. I know global sharing
>>>>> of Data is not possible across Hadoop MapReduce. But I need to check if
>>>>> that can be done somehow in hadoop Mapreduce also. Because I found some
>>>>> papers in KNN hadoop also.
>>>>> And I trying to compare the performance too.
>>>>>
>>>>> Hope some pointers can help me.
>>>>>
>>>>>
>>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> have you considered implementing using something like spark?  That
>>>>>> could be much easier than raw map-reduce
>>>>>>
>>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>>> unmeshabiju@gmail.com> wrote:
>>>>>>
>>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>>> predicting the records.
>>>>>>>
>>>>>>> Here is the example for KNN.
>>>>>>>
>>>>>>>
>>>>>>> [image: Inline image 1]
>>>>>>>
>>>>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>>>>> load them into Distributed cache.
>>>>>>>
>>>>>>> The one way is to split/partition the model Result into some files
>>>>>>> and perform the distance calculation for all records in that file and then
>>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>>> outcome.
>>>>>>>
>>>>>>> How can we parttion the file and perform the operation on these
>>>>>>> partition ?
>>>>>>>
>>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>>
>>>>>>> This is what came to my thought.
>>>>>>>
>>>>>>> Is there any further way.
>>>>>>>
>>>>>>> Any pointers would help me.
>>>>>>>
>>>>>>> --
>>>>>>> *Thanks & Regards *
>>>>>>>
>>>>>>>
>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Thanks & Regards *
>>>>>
>>>>>
>>>>> *Unmesha Sreeveni U.B*
>>>>> *Hadoop, Bigdata Developer*
>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by Drake민영근 <dr...@nexr.com>.

In my suggestion, map or reduce tasks do not use distributed cache. They
use file directly from HDFS with short circuit local read. Like a shared
storage method, but almost every node has the data with high-replication
factor.

Drake 민영근 Ph.D

On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> But stil if the model is very large enough, how can we load them inti
> Distributed cache or some thing like that.
> Here is one source : http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
> But it is confusing me
>
> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:
>
>> Hi,
>>
>> How about this ? The large model data stay in HDFS but with many
>> replications and MapReduce program read the model from HDFS. In theory, the
>> replication factor of model data equals with number of data nodes and with
>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>> tasks read the model data in their own disks.
>>
>> In this way, maybe use too many usage of HDFS, but the annoying partition
>> problem will be gone.
>>
>> Thanks
>>
>> Drake 민영근 Ph.D
>>
>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <un...@gmail.com>
>> wrote:
>>
>>> Is there any way..
>>> Waiting for a reply.I have posted the question every where..but none is
>>> responding back.
>>> I feel like this is the right place to ask doubts. As some of u may came
>>> across the same issue and get stuck.
>>>
>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>> unmeshabiju@gmail.com> wrote:
>>>
>>>> Yes, One of my friend is implemeting the same. I know global sharing of
>>>> Data is not possible across Hadoop MapReduce. But I need to check if that
>>>> can be done somehow in hadoop Mapreduce also. Because I found some papers
>>>> in KNN hadoop also.
>>>> And I trying to compare the performance too.
>>>>
>>>> Hope some pointers can help me.
>>>>
>>>>
>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> have you considered implementing using something like spark?  That
>>>>> could be much easier than raw map-reduce
>>>>>
>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>> unmeshabiju@gmail.com> wrote:
>>>>>
>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>> predicting the records.
>>>>>>
>>>>>> Here is the example for KNN.
>>>>>>
>>>>>>
>>>>>> [image: Inline image 1]
>>>>>>
>>>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>>>> load them into Distributed cache.
>>>>>>
>>>>>> The one way is to split/partition the model Result into some files
>>>>>> and perform the distance calculation for all records in that file and then
>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>> outcome.
>>>>>>
>>>>>> How can we parttion the file and perform the operation on these
>>>>>> partition ?
>>>>>>
>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>
>>>>>> This is what came to my thought.
>>>>>>
>>>>>> Is there any further way.
>>>>>>
>>>>>> Any pointers would help me.
>>>>>>
>>>>>> --
>>>>>> *Thanks & Regards *
>>>>>>
>>>>>>
>>>>>> *Unmesha Sreeveni U.B*
>>>>>> *Hadoop, Bigdata Developer*
>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by Drake민영근 <dr...@nexr.com>.

In my suggestion, map or reduce tasks do not use distributed cache. They
use file directly from HDFS with short circuit local read. Like a shared
storage method, but almost every node has the data with high-replication
factor.

Drake 민영근 Ph.D

On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> But stil if the model is very large enough, how can we load them inti
> Distributed cache or some thing like that.
> Here is one source : http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
> But it is confusing me
>
> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:
>
>> Hi,
>>
>> How about this ? The large model data stay in HDFS but with many
>> replications and MapReduce program read the model from HDFS. In theory, the
>> replication factor of model data equals with number of data nodes and with
>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>> tasks read the model data in their own disks.
>>
>> In this way, maybe use too many usage of HDFS, but the annoying partition
>> problem will be gone.
>>
>> Thanks
>>
>> Drake 민영근 Ph.D
>>
>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <un...@gmail.com>
>> wrote:
>>
>>> Is there any way..
>>> Waiting for a reply.I have posted the question every where..but none is
>>> responding back.
>>> I feel like this is the right place to ask doubts. As some of u may came
>>> across the same issue and get stuck.
>>>
>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>> unmeshabiju@gmail.com> wrote:
>>>
>>>> Yes, One of my friend is implemeting the same. I know global sharing of
>>>> Data is not possible across Hadoop MapReduce. But I need to check if that
>>>> can be done somehow in hadoop Mapreduce also. Because I found some papers
>>>> in KNN hadoop also.
>>>> And I trying to compare the performance too.
>>>>
>>>> Hope some pointers can help me.
>>>>
>>>>
>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> have you considered implementing using something like spark?  That
>>>>> could be much easier than raw map-reduce
>>>>>
>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>> unmeshabiju@gmail.com> wrote:
>>>>>
>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>> predicting the records.
>>>>>>
>>>>>> Here is the example for KNN.
>>>>>>
>>>>>>
>>>>>> [image: Inline image 1]
>>>>>>
>>>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>>>> load them into Distributed cache.
>>>>>>
>>>>>> The one way is to split/partition the model Result into some files
>>>>>> and perform the distance calculation for all records in that file and then
>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>> outcome.
>>>>>>
>>>>>> How can we parttion the file and perform the operation on these
>>>>>> partition ?
>>>>>>
>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>
>>>>>> This is what came to my thought.
>>>>>>
>>>>>> Is there any further way.
>>>>>>
>>>>>> Any pointers would help me.
>>>>>>
>>>>>> --
>>>>>> *Thanks & Regards *
>>>>>>
>>>>>>
>>>>>> *Unmesha Sreeveni U.B*
>>>>>> *Hadoop, Bigdata Developer*
>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by Drake민영근 <dr...@nexr.com>.

In my suggestion, map or reduce tasks do not use distributed cache. They
use file directly from HDFS with short circuit local read. Like a shared
storage method, but almost every node has the data with high-replication
factor.

Drake 민영근 Ph.D

On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> But stil if the model is very large enough, how can we load them inti
> Distributed cache or some thing like that.
> Here is one source : http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
> But it is confusing me
>
> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:
>
>> Hi,
>>
>> How about this ? The large model data stay in HDFS but with many
>> replications and MapReduce program read the model from HDFS. In theory, the
>> replication factor of model data equals with number of data nodes and with
>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>> tasks read the model data in their own disks.
>>
>> In this way, maybe use too many usage of HDFS, but the annoying partition
>> problem will be gone.
>>
>> Thanks
>>
>> Drake 민영근 Ph.D
>>
>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <un...@gmail.com>
>> wrote:
>>
>>> Is there any way..
>>> Waiting for a reply.I have posted the question every where..but none is
>>> responding back.
>>> I feel like this is the right place to ask doubts. As some of u may came
>>> across the same issue and get stuck.
>>>
>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>> unmeshabiju@gmail.com> wrote:
>>>
>>>> Yes, One of my friend is implemeting the same. I know global sharing of
>>>> Data is not possible across Hadoop MapReduce. But I need to check if that
>>>> can be done somehow in hadoop Mapreduce also. Because I found some papers
>>>> in KNN hadoop also.
>>>> And I trying to compare the performance too.
>>>>
>>>> Hope some pointers can help me.
>>>>
>>>>
>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> have you considered implementing using something like spark?  That
>>>>> could be much easier than raw map-reduce
>>>>>
>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>> unmeshabiju@gmail.com> wrote:
>>>>>
>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>> predicting the records.
>>>>>>
>>>>>> Here is the example for KNN.
>>>>>>
>>>>>>
>>>>>> [image: Inline image 1]
>>>>>>
>>>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>>>> load them into Distributed cache.
>>>>>>
>>>>>> The one way is to split/partition the model Result into some files
>>>>>> and perform the distance calculation for all records in that file and then
>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>> outcome.
>>>>>>
>>>>>> How can we parttion the file and perform the operation on these
>>>>>> partition ?
>>>>>>
>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>
>>>>>> This is what came to my thought.
>>>>>>
>>>>>> Is there any further way.
>>>>>>
>>>>>> Any pointers would help me.
>>>>>>
>>>>>> --
>>>>>> *Thanks & Regards *
>>>>>>
>>>>>>
>>>>>> *Unmesha Sreeveni U.B*
>>>>>> *Hadoop, Bigdata Developer*
>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by Drake민영근 <dr...@nexr.com>.

In my suggestion, map or reduce tasks do not use distributed cache. They
use file directly from HDFS with short circuit local read. Like a shared
storage method, but almost every node has the data with high-replication
factor.

Drake 민영근 Ph.D

On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> But stil if the model is very large enough, how can we load them inti
> Distributed cache or some thing like that.
> Here is one source : http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
> But it is confusing me
>
> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:
>
>> Hi,
>>
>> How about this ? The large model data stay in HDFS but with many
>> replications and MapReduce program read the model from HDFS. In theory, the
>> replication factor of model data equals with number of data nodes and with
>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>> tasks read the model data in their own disks.
>>
>> In this way, maybe use too many usage of HDFS, but the annoying partition
>> problem will be gone.
>>
>> Thanks
>>
>> Drake 민영근 Ph.D
>>
>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <un...@gmail.com>
>> wrote:
>>
>>> Is there any way..
>>> Waiting for a reply.I have posted the question every where..but none is
>>> responding back.
>>> I feel like this is the right place to ask doubts. As some of u may came
>>> across the same issue and get stuck.
>>>
>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>> unmeshabiju@gmail.com> wrote:
>>>
>>>> Yes, One of my friend is implemeting the same. I know global sharing of
>>>> Data is not possible across Hadoop MapReduce. But I need to check if that
>>>> can be done somehow in hadoop Mapreduce also. Because I found some papers
>>>> in KNN hadoop also.
>>>> And I trying to compare the performance too.
>>>>
>>>> Hope some pointers can help me.
>>>>
>>>>
>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> have you considered implementing using something like spark?  That
>>>>> could be much easier than raw map-reduce
>>>>>
>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>> unmeshabiju@gmail.com> wrote:
>>>>>
>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>> predicting the records.
>>>>>>
>>>>>> Here is the example for KNN.
>>>>>>
>>>>>>
>>>>>> [image: Inline image 1]
>>>>>>
>>>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>>>> load them into Distributed cache.
>>>>>>
>>>>>> The one way is to split/partition the model Result into some files
>>>>>> and perform the distance calculation for all records in that file and then
>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>> outcome.
>>>>>>
>>>>>> How can we parttion the file and perform the operation on these
>>>>>> partition ?
>>>>>>
>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>
>>>>>> This is what came to my thought.
>>>>>>
>>>>>> Is there any further way.
>>>>>>
>>>>>> Any pointers would help me.
>>>>>>
>>>>>> --
>>>>>> *Thanks & Regards *
>>>>>>
>>>>>>
>>>>>> *Unmesha Sreeveni U.B*
>>>>>> *Hadoop, Bigdata Developer*
>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

But stil if the model is very large enough, how can we load them inti
Distributed cache or some thing like that.
Here is one source : http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
But it is confusing me

On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:

> Hi,
>
> How about this ? The large model data stay in HDFS but with many
> replications and MapReduce program read the model from HDFS. In theory, the
> replication factor of model data equals with number of data nodes and with
> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
> tasks read the model data in their own disks.
>
> In this way, maybe use too many usage of HDFS, but the annoying partition
> problem will be gone.
>
> Thanks
>
> Drake 민영근 Ph.D
>
> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> Is there any way..
>> Waiting for a reply.I have posted the question every where..but none is
>> responding back.
>> I feel like this is the right place to ask doubts. As some of u may came
>> across the same issue and get stuck.
>>
>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <unmeshabiju@gmail.com
>> > wrote:
>>
>>> Yes, One of my friend is implemeting the same. I know global sharing of
>>> Data is not possible across Hadoop MapReduce. But I need to check if that
>>> can be done somehow in hadoop Mapreduce also. Because I found some papers
>>> in KNN hadoop also.
>>> And I trying to compare the performance too.
>>>
>>> Hope some pointers can help me.
>>>
>>>
>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> have you considered implementing using something like spark?  That
>>>> could be much easier than raw map-reduce
>>>>
>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>> unmeshabiju@gmail.com> wrote:
>>>>
>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>> predicting the records.
>>>>>
>>>>> Here is the example for KNN.
>>>>>
>>>>>
>>>>> [image: Inline image 1]
>>>>>
>>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>>> load them into Distributed cache.
>>>>>
>>>>> The one way is to split/partition the model Result into some files and
>>>>> perform the distance calculation for all records in that file and then find
>>>>> the min ditance and max occurance of classlabel and predict the outcome.
>>>>>
>>>>> How can we parttion the file and perform the operation on these
>>>>> partition ?
>>>>>
>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>
>>>>> This is what came to my thought.
>>>>>
>>>>> Is there any further way.
>>>>>
>>>>> Any pointers would help me.
>>>>>
>>>>> --
>>>>> *Thanks & Regards *
>>>>>
>>>>>
>>>>> *Unmesha Sreeveni U.B*
>>>>> *Hadoop, Bigdata Developer*
>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

But stil if the model is very large enough, how can we load them inti
Distributed cache or some thing like that.
Here is one source : http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
But it is confusing me

On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:

> Hi,
>
> How about this ? The large model data stay in HDFS but with many
> replications and MapReduce program read the model from HDFS. In theory, the
> replication factor of model data equals with number of data nodes and with
> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
> tasks read the model data in their own disks.
>
> In this way, maybe use too many usage of HDFS, but the annoying partition
> problem will be gone.
>
> Thanks
>
> Drake 민영근 Ph.D
>
> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> Is there any way..
>> Waiting for a reply.I have posted the question every where..but none is
>> responding back.
>> I feel like this is the right place to ask doubts. As some of u may came
>> across the same issue and get stuck.
>>
>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <unmeshabiju@gmail.com
>> > wrote:
>>
>>> Yes, One of my friend is implemeting the same. I know global sharing of
>>> Data is not possible across Hadoop MapReduce. But I need to check if that
>>> can be done somehow in hadoop Mapreduce also. Because I found some papers
>>> in KNN hadoop also.
>>> And I trying to compare the performance too.
>>>
>>> Hope some pointers can help me.
>>>
>>>
>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> have you considered implementing using something like spark?  That
>>>> could be much easier than raw map-reduce
>>>>
>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>> unmeshabiju@gmail.com> wrote:
>>>>
>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>> predicting the records.
>>>>>
>>>>> Here is the example for KNN.
>>>>>
>>>>>
>>>>> [image: Inline image 1]
>>>>>
>>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>>> load them into Distributed cache.
>>>>>
>>>>> The one way is to split/partition the model Result into some files and
>>>>> perform the distance calculation for all records in that file and then find
>>>>> the min ditance and max occurance of classlabel and predict the outcome.
>>>>>
>>>>> How can we parttion the file and perform the operation on these
>>>>> partition ?
>>>>>
>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>
>>>>> This is what came to my thought.
>>>>>
>>>>> Is there any further way.
>>>>>
>>>>> Any pointers would help me.
>>>>>
>>>>> --
>>>>> *Thanks & Regards *
>>>>>
>>>>>
>>>>> *Unmesha Sreeveni U.B*
>>>>> *Hadoop, Bigdata Developer*
>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

But stil if the model is very large enough, how can we load them inti
Distributed cache or some thing like that.
Here is one source : http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
But it is confusing me

On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:

> Hi,
>
> How about this ? The large model data stay in HDFS but with many
> replications and MapReduce program read the model from HDFS. In theory, the
> replication factor of model data equals with number of data nodes and with
> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
> tasks read the model data in their own disks.
>
> In this way, maybe use too many usage of HDFS, but the annoying partition
> problem will be gone.
>
> Thanks
>
> Drake 민영근 Ph.D
>
> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> Is there any way..
>> Waiting for a reply.I have posted the question every where..but none is
>> responding back.
>> I feel like this is the right place to ask doubts. As some of u may came
>> across the same issue and get stuck.
>>
>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <unmeshabiju@gmail.com
>> > wrote:
>>
>>> Yes, One of my friend is implemeting the same. I know global sharing of
>>> Data is not possible across Hadoop MapReduce. But I need to check if that
>>> can be done somehow in hadoop Mapreduce also. Because I found some papers
>>> in KNN hadoop also.
>>> And I trying to compare the performance too.
>>>
>>> Hope some pointers can help me.
>>>
>>>
>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> have you considered implementing using something like spark?  That
>>>> could be much easier than raw map-reduce
>>>>
>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>> unmeshabiju@gmail.com> wrote:
>>>>
>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>> predicting the records.
>>>>>
>>>>> Here is the example for KNN.
>>>>>
>>>>>
>>>>> [image: Inline image 1]
>>>>>
>>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>>> load them into Distributed cache.
>>>>>
>>>>> The one way is to split/partition the model Result into some files and
>>>>> perform the distance calculation for all records in that file and then find
>>>>> the min ditance and max occurance of classlabel and predict the outcome.
>>>>>
>>>>> How can we parttion the file and perform the operation on these
>>>>> partition ?
>>>>>
>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>
>>>>> This is what came to my thought.
>>>>>
>>>>> Is there any further way.
>>>>>
>>>>> Any pointers would help me.
>>>>>
>>>>> --
>>>>> *Thanks & Regards *
>>>>>
>>>>>
>>>>> *Unmesha Sreeveni U.B*
>>>>> *Hadoop, Bigdata Developer*
>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

But stil if the model is very large enough, how can we load them inti
Distributed cache or some thing like that.
Here is one source : http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
But it is confusing me

On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <dr...@nexr.com> wrote:

> Hi,
>
> How about this ? The large model data stay in HDFS but with many
> replications and MapReduce program read the model from HDFS. In theory, the
> replication factor of model data equals with number of data nodes and with
> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
> tasks read the model data in their own disks.
>
> In this way, maybe use too many usage of HDFS, but the annoying partition
> problem will be gone.
>
> Thanks
>
> Drake 민영근 Ph.D
>
> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> Is there any way..
>> Waiting for a reply.I have posted the question every where..but none is
>> responding back.
>> I feel like this is the right place to ask doubts. As some of u may came
>> across the same issue and get stuck.
>>
>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <unmeshabiju@gmail.com
>> > wrote:
>>
>>> Yes, One of my friend is implemeting the same. I know global sharing of
>>> Data is not possible across Hadoop MapReduce. But I need to check if that
>>> can be done somehow in hadoop Mapreduce also. Because I found some papers
>>> in KNN hadoop also.
>>> And I trying to compare the performance too.
>>>
>>> Hope some pointers can help me.
>>>
>>>
>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> have you considered implementing using something like spark?  That
>>>> could be much easier than raw map-reduce
>>>>
>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>> unmeshabiju@gmail.com> wrote:
>>>>
>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>> predicting the records.
>>>>>
>>>>> Here is the example for KNN.
>>>>>
>>>>>
>>>>> [image: Inline image 1]
>>>>>
>>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>>> load them into Distributed cache.
>>>>>
>>>>> The one way is to split/partition the model Result into some files and
>>>>> perform the distance calculation for all records in that file and then find
>>>>> the min ditance and max occurance of classlabel and predict the outcome.
>>>>>
>>>>> How can we parttion the file and perform the operation on these
>>>>> partition ?
>>>>>
>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>
>>>>> This is what came to my thought.
>>>>>
>>>>> Is there any further way.
>>>>>
>>>>> Any pointers would help me.
>>>>>
>>>>> --
>>>>> *Thanks & Regards *
>>>>>
>>>>>
>>>>> *Unmesha Sreeveni U.B*
>>>>> *Hadoop, Bigdata Developer*
>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by Drake민영근 <dr...@nexr.com>.

Hi,

How about this ? The large model data stay in HDFS but with many
replications and MapReduce program read the model from HDFS. In theory, the
replication factor of model data equals with number of data nodes and with
the Short Circuit Local Reads function of HDFS datanode, the map or reduce
tasks read the model data in their own disks.

In this way, maybe use too many usage of HDFS, but the annoying partition
problem will be gone.

Thanks

Drake 민영근 Ph.D

On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Is there any way..
> Waiting for a reply.I have posted the question every where..but none is
> responding back.
> I feel like this is the right place to ask doubts. As some of u may came
> across the same issue and get stuck.
>
> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> Yes, One of my friend is implemeting the same. I know global sharing of
>> Data is not possible across Hadoop MapReduce. But I need to check if that
>> can be done somehow in hadoop Mapreduce also. Because I found some papers
>> in KNN hadoop also.
>> And I trying to compare the performance too.
>>
>> Hope some pointers can help me.
>>
>>
>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>>>
>>> have you considered implementing using something like spark?  That could
>>> be much easier than raw map-reduce
>>>
>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>> unmeshabiju@gmail.com> wrote:
>>>
>>>> In KNN like algorithm we need to load model Data into cache for
>>>> predicting the records.
>>>>
>>>> Here is the example for KNN.
>>>>
>>>>
>>>> [image: Inline image 1]
>>>>
>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>> load them into Distributed cache.
>>>>
>>>> The one way is to split/partition the model Result into some files and
>>>> perform the distance calculation for all records in that file and then find
>>>> the min ditance and max occurance of classlabel and predict the outcome.
>>>>
>>>> How can we parttion the file and perform the operation on these
>>>> partition ?
>>>>
>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>      2nd record <Distance> parttition1,partition2,...
>>>>
>>>> This is what came to my thought.
>>>>
>>>> Is there any further way.
>>>>
>>>> Any pointers would help me.
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by Drake민영근 <dr...@nexr.com>.

Hi,

How about this ? The large model data stay in HDFS but with many
replications and MapReduce program read the model from HDFS. In theory, the
replication factor of model data equals with number of data nodes and with
the Short Circuit Local Reads function of HDFS datanode, the map or reduce
tasks read the model data in their own disks.

In this way, maybe use too many usage of HDFS, but the annoying partition
problem will be gone.

Thanks

Drake 민영근 Ph.D

On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Is there any way..
> Waiting for a reply.I have posted the question every where..but none is
> responding back.
> I feel like this is the right place to ask doubts. As some of u may came
> across the same issue and get stuck.
>
> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> Yes, One of my friend is implemeting the same. I know global sharing of
>> Data is not possible across Hadoop MapReduce. But I need to check if that
>> can be done somehow in hadoop Mapreduce also. Because I found some papers
>> in KNN hadoop also.
>> And I trying to compare the performance too.
>>
>> Hope some pointers can help me.
>>
>>
>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>>>
>>> have you considered implementing using something like spark?  That could
>>> be much easier than raw map-reduce
>>>
>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>> unmeshabiju@gmail.com> wrote:
>>>
>>>> In KNN like algorithm we need to load model Data into cache for
>>>> predicting the records.
>>>>
>>>> Here is the example for KNN.
>>>>
>>>>
>>>> [image: Inline image 1]
>>>>
>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>> load them into Distributed cache.
>>>>
>>>> The one way is to split/partition the model Result into some files and
>>>> perform the distance calculation for all records in that file and then find
>>>> the min ditance and max occurance of classlabel and predict the outcome.
>>>>
>>>> How can we parttion the file and perform the operation on these
>>>> partition ?
>>>>
>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>      2nd record <Distance> parttition1,partition2,...
>>>>
>>>> This is what came to my thought.
>>>>
>>>> Is there any further way.
>>>>
>>>> Any pointers would help me.
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by Drake민영근 <dr...@nexr.com>.

Hi,

How about this ? The large model data stay in HDFS but with many
replications and MapReduce program read the model from HDFS. In theory, the
replication factor of model data equals with number of data nodes and with
the Short Circuit Local Reads function of HDFS datanode, the map or reduce
tasks read the model data in their own disks.

In this way, maybe use too many usage of HDFS, but the annoying partition
problem will be gone.

Thanks

Drake 민영근 Ph.D

On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Is there any way..
> Waiting for a reply.I have posted the question every where..but none is
> responding back.
> I feel like this is the right place to ask doubts. As some of u may came
> across the same issue and get stuck.
>
> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> Yes, One of my friend is implemeting the same. I know global sharing of
>> Data is not possible across Hadoop MapReduce. But I need to check if that
>> can be done somehow in hadoop Mapreduce also. Because I found some papers
>> in KNN hadoop also.
>> And I trying to compare the performance too.
>>
>> Hope some pointers can help me.
>>
>>
>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>>>
>>> have you considered implementing using something like spark?  That could
>>> be much easier than raw map-reduce
>>>
>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>> unmeshabiju@gmail.com> wrote:
>>>
>>>> In KNN like algorithm we need to load model Data into cache for
>>>> predicting the records.
>>>>
>>>> Here is the example for KNN.
>>>>
>>>>
>>>> [image: Inline image 1]
>>>>
>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>> load them into Distributed cache.
>>>>
>>>> The one way is to split/partition the model Result into some files and
>>>> perform the distance calculation for all records in that file and then find
>>>> the min ditance and max occurance of classlabel and predict the outcome.
>>>>
>>>> How can we parttion the file and perform the operation on these
>>>> partition ?
>>>>
>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>      2nd record <Distance> parttition1,partition2,...
>>>>
>>>> This is what came to my thought.
>>>>
>>>> Is there any further way.
>>>>
>>>> Any pointers would help me.
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by Drake민영근 <dr...@nexr.com>.

Hi,

How about this ? The large model data stay in HDFS but with many
replications and MapReduce program read the model from HDFS. In theory, the
replication factor of model data equals with number of data nodes and with
the Short Circuit Local Reads function of HDFS datanode, the map or reduce
tasks read the model data in their own disks.

In this way, maybe use too many usage of HDFS, but the annoying partition
problem will be gone.

Thanks

Drake 민영근 Ph.D

On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Is there any way..
> Waiting for a reply.I have posted the question every where..but none is
> responding back.
> I feel like this is the right place to ask doubts. As some of u may came
> across the same issue and get stuck.
>
> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> Yes, One of my friend is implemeting the same. I know global sharing of
>> Data is not possible across Hadoop MapReduce. But I need to check if that
>> can be done somehow in hadoop Mapreduce also. Because I found some papers
>> in KNN hadoop also.
>> And I trying to compare the performance too.
>>
>> Hope some pointers can help me.
>>
>>
>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>>>
>>> have you considered implementing using something like spark?  That could
>>> be much easier than raw map-reduce
>>>
>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>> unmeshabiju@gmail.com> wrote:
>>>
>>>> In KNN like algorithm we need to load model Data into cache for
>>>> predicting the records.
>>>>
>>>> Here is the example for KNN.
>>>>
>>>>
>>>> [image: Inline image 1]
>>>>
>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>> load them into Distributed cache.
>>>>
>>>> The one way is to split/partition the model Result into some files and
>>>> perform the distance calculation for all records in that file and then find
>>>> the min ditance and max occurance of classlabel and predict the outcome.
>>>>
>>>> How can we parttion the file and perform the operation on these
>>>> partition ?
>>>>
>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>      2nd record <Distance> parttition1,partition2,...
>>>>
>>>> This is what came to my thought.
>>>>
>>>> Is there any further way.
>>>>
>>>> Any pointers would help me.
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

Is there any way..
Waiting for a reply.I have posted the question every where..but none is
responding back.
I feel like this is the right place to ask doubts. As some of u may came
across the same issue and get stuck.

On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Yes, One of my friend is implemeting the same. I know global sharing of
> Data is not possible across Hadoop MapReduce. But I need to check if that
> can be done somehow in hadoop Mapreduce also. Because I found some papers
> in KNN hadoop also.
> And I trying to compare the performance too.
>
> Hope some pointers can help me.
>
>
> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
>>
>> have you considered implementing using something like spark?  That could
>> be much easier than raw map-reduce
>>
>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <unmeshabiju@gmail.com
>> > wrote:
>>
>>> In KNN like algorithm we need to load model Data into cache for
>>> predicting the records.
>>>
>>> Here is the example for KNN.
>>>
>>>
>>> [image: Inline image 1]
>>>
>>> So if the model will be a large file say1 or 2 GB we will be able to
>>> load them into Distributed cache.
>>>
>>> The one way is to split/partition the model Result into some files and
>>> perform the distance calculation for all records in that file and then find
>>> the min ditance and max occurance of classlabel and predict the outcome.
>>>
>>> How can we parttion the file and perform the operation on these
>>> partition ?
>>>
>>> ie  1 record <Distance> parttition1,partition2,....
>>>      2nd record <Distance> parttition1,partition2,...
>>>
>>> This is what came to my thought.
>>>
>>> Is there any further way.
>>>
>>> Any pointers would help me.
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

Is there any way..
Waiting for a reply.I have posted the question every where..but none is
responding back.
I feel like this is the right place to ask doubts. As some of u may came
across the same issue and get stuck.

On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Yes, One of my friend is implemeting the same. I know global sharing of
> Data is not possible across Hadoop MapReduce. But I need to check if that
> can be done somehow in hadoop Mapreduce also. Because I found some papers
> in KNN hadoop also.
> And I trying to compare the performance too.
>
> Hope some pointers can help me.
>
>
> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
>>
>> have you considered implementing using something like spark?  That could
>> be much easier than raw map-reduce
>>
>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <unmeshabiju@gmail.com
>> > wrote:
>>
>>> In KNN like algorithm we need to load model Data into cache for
>>> predicting the records.
>>>
>>> Here is the example for KNN.
>>>
>>>
>>> [image: Inline image 1]
>>>
>>> So if the model will be a large file say1 or 2 GB we will be able to
>>> load them into Distributed cache.
>>>
>>> The one way is to split/partition the model Result into some files and
>>> perform the distance calculation for all records in that file and then find
>>> the min ditance and max occurance of classlabel and predict the outcome.
>>>
>>> How can we parttion the file and perform the operation on these
>>> partition ?
>>>
>>> ie  1 record <Distance> parttition1,partition2,....
>>>      2nd record <Distance> parttition1,partition2,...
>>>
>>> This is what came to my thought.
>>>
>>> Is there any further way.
>>>
>>> Any pointers would help me.
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

Is there any way..
Waiting for a reply.I have posted the question every where..but none is
responding back.
I feel like this is the right place to ask doubts. As some of u may came
across the same issue and get stuck.

On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Yes, One of my friend is implemeting the same. I know global sharing of
> Data is not possible across Hadoop MapReduce. But I need to check if that
> can be done somehow in hadoop Mapreduce also. Because I found some papers
> in KNN hadoop also.
> And I trying to compare the performance too.
>
> Hope some pointers can help me.
>
>
> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
>>
>> have you considered implementing using something like spark?  That could
>> be much easier than raw map-reduce
>>
>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <unmeshabiju@gmail.com
>> > wrote:
>>
>>> In KNN like algorithm we need to load model Data into cache for
>>> predicting the records.
>>>
>>> Here is the example for KNN.
>>>
>>>
>>> [image: Inline image 1]
>>>
>>> So if the model will be a large file say1 or 2 GB we will be able to
>>> load them into Distributed cache.
>>>
>>> The one way is to split/partition the model Result into some files and
>>> perform the distance calculation for all records in that file and then find
>>> the min ditance and max occurance of classlabel and predict the outcome.
>>>
>>> How can we parttion the file and perform the operation on these
>>> partition ?
>>>
>>> ie  1 record <Distance> parttition1,partition2,....
>>>      2nd record <Distance> parttition1,partition2,...
>>>
>>> This is what came to my thought.
>>>
>>> Is there any further way.
>>>
>>> Any pointers would help me.
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

Is there any way..
Waiting for a reply.I have posted the question every where..but none is
responding back.
I feel like this is the right place to ask doubts. As some of u may came
across the same issue and get stuck.

On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Yes, One of my friend is implemeting the same. I know global sharing of
> Data is not possible across Hadoop MapReduce. But I need to check if that
> can be done somehow in hadoop Mapreduce also. Because I found some papers
> in KNN hadoop also.
> And I trying to compare the performance too.
>
> Hope some pointers can help me.
>
>
> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
>>
>> have you considered implementing using something like spark?  That could
>> be much easier than raw map-reduce
>>
>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <unmeshabiju@gmail.com
>> > wrote:
>>
>>> In KNN like algorithm we need to load model Data into cache for
>>> predicting the records.
>>>
>>> Here is the example for KNN.
>>>
>>>
>>> [image: Inline image 1]
>>>
>>> So if the model will be a large file say1 or 2 GB we will be able to
>>> load them into Distributed cache.
>>>
>>> The one way is to split/partition the model Result into some files and
>>> perform the distance calculation for all records in that file and then find
>>> the min ditance and max occurance of classlabel and predict the outcome.
>>>
>>> How can we parttion the file and perform the operation on these
>>> partition ?
>>>
>>> ie  1 record <Distance> parttition1,partition2,....
>>>      2nd record <Distance> parttition1,partition2,...
>>>
>>> This is what came to my thought.
>>>
>>> Is there any further way.
>>>
>>> Any pointers would help me.
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

Is there any way..
Waiting for a reply.I have posted the question every where..but none is
responding back.
I feel like this is the right place to ask doubts. As some of u may came
across the same issue and get stuck.

On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Yes, One of my friend is implemeting the same. I know global sharing of
> Data is not possible across Hadoop MapReduce. But I need to check if that
> can be done somehow in hadoop Mapreduce also. Because I found some papers
> in KNN hadoop also.
> And I trying to compare the performance too.
>
> Hope some pointers can help me.
>
>
> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
>>
>> have you considered implementing using something like spark?  That could
>> be much easier than raw map-reduce
>>
>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <unmeshabiju@gmail.com
>> > wrote:
>>
>>> In KNN like algorithm we need to load model Data into cache for
>>> predicting the records.
>>>
>>> Here is the example for KNN.
>>>
>>>
>>> [image: Inline image 1]
>>>
>>> So if the model will be a large file say1 or 2 GB we will be able to
>>> load them into Distributed cache.
>>>
>>> The one way is to split/partition the model Result into some files and
>>> perform the distance calculation for all records in that file and then find
>>> the min ditance and max occurance of classlabel and predict the outcome.
>>>
>>> How can we parttion the file and perform the operation on these
>>> partition ?
>>>
>>> ie  1 record <Distance> parttition1,partition2,....
>>>      2nd record <Distance> parttition1,partition2,...
>>>
>>> This is what came to my thought.
>>>
>>> Is there any further way.
>>>
>>> Any pointers would help me.
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

Yes, One of my friend is implemeting the same. I know global sharing of
Data is not possible across Hadoop MapReduce. But I need to check if that
can be done somehow in hadoop Mapreduce also. Because I found some papers
in KNN hadoop also.
And I trying to compare the performance too.

Hope some pointers can help me.


On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com> wrote:

>
> have you considered implementing using something like spark?  That could
> be much easier than raw map-reduce
>
> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> In KNN like algorithm we need to load model Data into cache for
>> predicting the records.
>>
>> Here is the example for KNN.
>>
>>
>> [image: Inline image 1]
>>
>> So if the model will be a large file say1 or 2 GB we will be able to load
>> them into Distributed cache.
>>
>> The one way is to split/partition the model Result into some files and
>> perform the distance calculation for all records in that file and then find
>> the min ditance and max occurance of classlabel and predict the outcome.
>>
>> How can we parttion the file and perform the operation on these partition
>> ?
>>
>> ie  1 record <Distance> parttition1,partition2,....
>>      2nd record <Distance> parttition1,partition2,...
>>
>> This is what came to my thought.
>>
>> Is there any further way.
>>
>> Any pointers would help me.
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

Yes, One of my friend is implemeting the same. I know global sharing of
Data is not possible across Hadoop MapReduce. But I need to check if that
can be done somehow in hadoop Mapreduce also. Because I found some papers
in KNN hadoop also.
And I trying to compare the performance too.

Hope some pointers can help me.


On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com> wrote:

>
> have you considered implementing using something like spark?  That could
> be much easier than raw map-reduce
>
> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> In KNN like algorithm we need to load model Data into cache for
>> predicting the records.
>>
>> Here is the example for KNN.
>>
>>
>> [image: Inline image 1]
>>
>> So if the model will be a large file say1 or 2 GB we will be able to load
>> them into Distributed cache.
>>
>> The one way is to split/partition the model Result into some files and
>> perform the distance calculation for all records in that file and then find
>> the min ditance and max occurance of classlabel and predict the outcome.
>>
>> How can we parttion the file and perform the operation on these partition
>> ?
>>
>> ie  1 record <Distance> parttition1,partition2,....
>>      2nd record <Distance> parttition1,partition2,...
>>
>> This is what came to my thought.
>>
>> Is there any further way.
>>
>> Any pointers would help me.
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

Yes, One of my friend is implemeting the same. I know global sharing of
Data is not possible across Hadoop MapReduce. But I need to check if that
can be done somehow in hadoop Mapreduce also. Because I found some papers
in KNN hadoop also.
And I trying to compare the performance too.

Hope some pointers can help me.


On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com> wrote:

>
> have you considered implementing using something like spark?  That could
> be much easier than raw map-reduce
>
> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> In KNN like algorithm we need to load model Data into cache for
>> predicting the records.
>>
>> Here is the example for KNN.
>>
>>
>> [image: Inline image 1]
>>
>> So if the model will be a large file say1 or 2 GB we will be able to load
>> them into Distributed cache.
>>
>> The one way is to split/partition the model Result into some files and
>> perform the distance calculation for all records in that file and then find
>> the min ditance and max occurance of classlabel and predict the outcome.
>>
>> How can we parttion the file and perform the operation on these partition
>> ?
>>
>> ie  1 record <Distance> parttition1,partition2,....
>>      2nd record <Distance> parttition1,partition2,...
>>
>> This is what came to my thought.
>>
>> Is there any further way.
>>
>> Any pointers would help me.
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

Yes, One of my friend is implemeting the same. I know global sharing of
Data is not possible across Hadoop MapReduce. But I need to check if that
can be done somehow in hadoop Mapreduce also. Because I found some papers
in KNN hadoop also.
And I trying to compare the performance too.

Hope some pointers can help me.


On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com> wrote:

>
> have you considered implementing using something like spark?  That could
> be much easier than raw map-reduce
>
> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> In KNN like algorithm we need to load model Data into cache for
>> predicting the records.
>>
>> Here is the example for KNN.
>>
>>
>> [image: Inline image 1]
>>
>> So if the model will be a large file say1 or 2 GB we will be able to load
>> them into Distributed cache.
>>
>> The one way is to split/partition the model Result into some files and
>> perform the distance calculation for all records in that file and then find
>> the min ditance and max occurance of classlabel and predict the outcome.
>>
>> How can we parttion the file and perform the operation on these partition
>> ?
>>
>> ie  1 record <Distance> parttition1,partition2,....
>>      2nd record <Distance> parttition1,partition2,...
>>
>> This is what came to my thought.
>>
>> Is there any further way.
>>
>> Any pointers would help me.
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by unmesha sreeveni <un...@gmail.com>.

Yes, One of my friend is implemeting the same. I know global sharing of
Data is not possible across Hadoop MapReduce. But I need to check if that
can be done somehow in hadoop Mapreduce also. Because I found some papers
in KNN hadoop also.
And I trying to compare the performance too.

Hope some pointers can help me.


On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <te...@gmail.com> wrote:

>
> have you considered implementing using something like spark?  That could
> be much easier than raw map-reduce
>
> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
>> In KNN like algorithm we need to load model Data into cache for
>> predicting the records.
>>
>> Here is the example for KNN.
>>
>>
>> [image: Inline image 1]
>>
>> So if the model will be a large file say1 or 2 GB we will be able to load
>> them into Distributed cache.
>>
>> The one way is to split/partition the model Result into some files and
>> perform the distance calculation for all records in that file and then find
>> the min ditance and max occurance of classlabel and predict the outcome.
>>
>> How can we parttion the file and perform the operation on these partition
>> ?
>>
>> ie  1 record <Distance> parttition1,partition2,....
>>      2nd record <Distance> parttition1,partition2,...
>>
>> This is what came to my thought.
>>
>> Is there any further way.
>>
>> Any pointers would help me.
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by Ted Dunning <te...@gmail.com>.

have you considered implementing using something like spark?  That could be
much easier than raw map-reduce

On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> In KNN like algorithm we need to load model Data into cache for predicting
> the records.
>
> Here is the example for KNN.
>
>
> [image: Inline image 1]
>
> So if the model will be a large file say1 or 2 GB we will be able to load
> them into Distributed cache.
>
> The one way is to split/partition the model Result into some files and
> perform the distance calculation for all records in that file and then find
> the min ditance and max occurance of classlabel and predict the outcome.
>
> How can we parttion the file and perform the operation on these partition ?
>
> ie  1 record <Distance> parttition1,partition2,....
>      2nd record <Distance> parttition1,partition2,...
>
> This is what came to my thought.
>
> Is there any further way.
>
> Any pointers would help me.
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by Ted Dunning <te...@gmail.com>.

have you considered implementing using something like spark?  That could be
much easier than raw map-reduce

On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> In KNN like algorithm we need to load model Data into cache for predicting
> the records.
>
> Here is the example for KNN.
>
>
> [image: Inline image 1]
>
> So if the model will be a large file say1 or 2 GB we will be able to load
> them into Distributed cache.
>
> The one way is to split/partition the model Result into some files and
> perform the distance calculation for all records in that file and then find
> the min ditance and max occurance of classlabel and predict the outcome.
>
> How can we parttion the file and perform the operation on these partition ?
>
> ie  1 record <Distance> parttition1,partition2,....
>      2nd record <Distance> parttition1,partition2,...
>
> This is what came to my thought.
>
> Is there any further way.
>
> Any pointers would help me.
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by Ted Dunning <te...@gmail.com>.

have you considered implementing using something like spark?  That could be
much easier than raw map-reduce

On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> In KNN like algorithm we need to load model Data into cache for predicting
> the records.
>
> Here is the example for KNN.
>
>
> [image: Inline image 1]
>
> So if the model will be a large file say1 or 2 GB we will be able to load
> them into Distributed cache.
>
> The one way is to split/partition the model Result into some files and
> perform the distance calculation for all records in that file and then find
> the min ditance and max occurance of classlabel and predict the outcome.
>
> How can we parttion the file and perform the operation on these partition ?
>
> ie  1 record <Distance> parttition1,partition2,....
>      2nd record <Distance> parttition1,partition2,...
>
> This is what came to my thought.
>
> Is there any further way.
>
> Any pointers would help me.
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by Ted Dunning <te...@gmail.com>.

have you considered implementing using something like spark?  That could be
much easier than raw map-reduce

On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> In KNN like algorithm we need to load model Data into cache for predicting
> the records.
>
> Here is the example for KNN.
>
>
> [image: Inline image 1]
>
> So if the model will be a large file say1 or 2 GB we will be able to load
> them into Distributed cache.
>
> The one way is to split/partition the model Result into some files and
> perform the distance calculation for all records in that file and then find
> the min ditance and max occurance of classlabel and predict the outcome.
>
> How can we parttion the file and perform the operation on these partition ?
>
> ie  1 record <Distance> parttition1,partition2,....
>      2nd record <Distance> parttition1,partition2,...
>
> This is what came to my thought.
>
> Is there any further way.
>
> Any pointers would help me.
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Posted by Ted Dunning <te...@gmail.com>.

have you considered implementing using something like spark?  That could be
much easier than raw map-reduce

On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> In KNN like algorithm we need to load model Data into cache for predicting
> the records.
>
> Here is the example for KNN.
>
>
> [image: Inline image 1]
>
> So if the model will be a large file say1 or 2 GB we will be able to load
> them into Distributed cache.
>
> The one way is to split/partition the model Result into some files and
> perform the distance calculation for all records in that file and then find
> the min ditance and max occurance of classlabel and predict the outcome.
>
> How can we parttion the file and perform the operation on these partition ?
>
> ie  1 record <Distance> parttition1,partition2,....
>      2nd record <Distance> parttition1,partition2,...
>
> This is what came to my thought.
>
> Is there any further way.
>
> Any pointers would help me.
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>