You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by jamal sasha <ja...@gmail.com> on 2013/04/01 23:25:26 UTC

Finding mean and median python streaming

Very dumb question..
I have data as following
id1, value
1, 20.2
1,20.4
....

I want to find the mean and median of id1?
I am using python hadoop streaming..
mapper.py
for line in sys.stdin:
try:
# remove leading and trailing whitespace
line = line.rstrip(os.linesep)
tokens = line.split(",")
 print '%s,%s' % (tokens[0],tokens[1])
except Exception:
continue


reducer.py
def mean(data_list):
return sum(data_list)/float(len(data_list)) if len(data_list) else 0
def median(mylist):
    sorts = sorted(mylist)
    length = len(sorts)
    if not length % 2:
        return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
    return sorts[length / 2]


for line in sys.stdin:
try:
line = line.rstrip(os.linesep)
serial_id, duration = line.split(",")
data_dict[serial_id].append(float(duration))
except Exception:
pass
for k,v in data_dict.items():
print "%s,%s,%s" %(k, mean(v), median(v))



I am expecting a single mean,median to each key
But I see id1 duplicated with different mean and median..
Any suggestions?

Re: Finding mean and median python streaming

Posted by Yanbo Liang <ya...@gmail.com>.

Because each Reducer is just the local mean and median of the data which
are shuffled to that Reducer.


2013/4/5 jamal sasha <ja...@gmail.com>

> Hi,
>   I have a question.
> Why do I need to set number of reducers to 1.
> Since all the keys are sorted shouldnt most of the keys go to same reducer?
>
>
>
> On Tue, Apr 2, 2013 at 2:14 AM, Yanbo Liang <ya...@gmail.com> wrote:
>
>> How many Reducer did you start for this job?
>> If you start many Reducers for this job, it will produce multiple output
>> file which named as part-*****.
>> And each part is only the local mean and median value of the specific
>> Reducer partition.
>>
>> Two kinds of solutions:
>> 1, Call the method of setNumReduceTasks(1) to set the Reducer number to
>> 1, and it will produce only one output file and each distinct key will
>> produce only one mean and median value.
>> 2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
>> code. It process all the output file which produced by multiple Reducer by
>> a local function, and it produce the ultimate result.
>>
>> BR
>>  Yanbo
>>
>>
>> 2013/4/2 jamal sasha <ja...@gmail.com>
>>
>>> pinging again.
>>> Let me rephrase the question.
>>> If my data is like:
>>> id, value
>>>
>>> And I want to find average "value" for each id, how can i do that using
>>> hadoop streaming?
>>> I am sure, it should be very straightforward but aparently my
>>> understanding of how code works in hadoop streaming is not right.
>>> I would really appreciate if someone can help me with this query.
>>> THanks
>>>
>>>
>>>
>>> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <ja...@gmail.com>wrote:
>>>
>>>> data_dict is declared globably as
>>>> data_dict = defaultdict(list)
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>
>>>>> Very dumb question..
>>>>> I have data as following
>>>>> id1, value
>>>>> 1, 20.2
>>>>> 1,20.4
>>>>> ....
>>>>>
>>>>> I want to find the mean and median of id1?
>>>>> I am using python hadoop streaming..
>>>>> mapper.py
>>>>> for line in sys.stdin:
>>>>> try:
>>>>> # remove leading and trailing whitespace
>>>>>  line = line.rstrip(os.linesep)
>>>>> tokens = line.split(",")
>>>>>  print '%s,%s' % (tokens[0],tokens[1])
>>>>> except Exception:
>>>>> continue
>>>>>
>>>>>
>>>>> reducer.py
>>>>> def mean(data_list):
>>>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>>>> def median(mylist):
>>>>>     sorts = sorted(mylist)
>>>>>     length = len(sorts)
>>>>>     if not length % 2:
>>>>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>>>>     return sorts[length / 2]
>>>>>
>>>>>
>>>>> for line in sys.stdin:
>>>>> try:
>>>>> line = line.rstrip(os.linesep)
>>>>> serial_id, duration = line.split(",")
>>>>>  data_dict[serial_id].append(float(duration))
>>>>> except Exception:
>>>>> pass
>>>>> for k,v in data_dict.items():
>>>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>>>
>>>>>
>>>>>
>>>>> I am expecting a single mean,median to each key
>>>>> But I see id1 duplicated with different mean and median..
>>>>> Any suggestions?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Finding mean and median python streaming

Posted by Yanbo Liang <ya...@gmail.com>.

Because each Reducer is just the local mean and median of the data which
are shuffled to that Reducer.


2013/4/5 jamal sasha <ja...@gmail.com>

> Hi,
>   I have a question.
> Why do I need to set number of reducers to 1.
> Since all the keys are sorted shouldnt most of the keys go to same reducer?
>
>
>
> On Tue, Apr 2, 2013 at 2:14 AM, Yanbo Liang <ya...@gmail.com> wrote:
>
>> How many Reducer did you start for this job?
>> If you start many Reducers for this job, it will produce multiple output
>> file which named as part-*****.
>> And each part is only the local mean and median value of the specific
>> Reducer partition.
>>
>> Two kinds of solutions:
>> 1, Call the method of setNumReduceTasks(1) to set the Reducer number to
>> 1, and it will produce only one output file and each distinct key will
>> produce only one mean and median value.
>> 2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
>> code. It process all the output file which produced by multiple Reducer by
>> a local function, and it produce the ultimate result.
>>
>> BR
>>  Yanbo
>>
>>
>> 2013/4/2 jamal sasha <ja...@gmail.com>
>>
>>> pinging again.
>>> Let me rephrase the question.
>>> If my data is like:
>>> id, value
>>>
>>> And I want to find average "value" for each id, how can i do that using
>>> hadoop streaming?
>>> I am sure, it should be very straightforward but aparently my
>>> understanding of how code works in hadoop streaming is not right.
>>> I would really appreciate if someone can help me with this query.
>>> THanks
>>>
>>>
>>>
>>> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <ja...@gmail.com>wrote:
>>>
>>>> data_dict is declared globably as
>>>> data_dict = defaultdict(list)
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>
>>>>> Very dumb question..
>>>>> I have data as following
>>>>> id1, value
>>>>> 1, 20.2
>>>>> 1,20.4
>>>>> ....
>>>>>
>>>>> I want to find the mean and median of id1?
>>>>> I am using python hadoop streaming..
>>>>> mapper.py
>>>>> for line in sys.stdin:
>>>>> try:
>>>>> # remove leading and trailing whitespace
>>>>>  line = line.rstrip(os.linesep)
>>>>> tokens = line.split(",")
>>>>>  print '%s,%s' % (tokens[0],tokens[1])
>>>>> except Exception:
>>>>> continue
>>>>>
>>>>>
>>>>> reducer.py
>>>>> def mean(data_list):
>>>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>>>> def median(mylist):
>>>>>     sorts = sorted(mylist)
>>>>>     length = len(sorts)
>>>>>     if not length % 2:
>>>>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>>>>     return sorts[length / 2]
>>>>>
>>>>>
>>>>> for line in sys.stdin:
>>>>> try:
>>>>> line = line.rstrip(os.linesep)
>>>>> serial_id, duration = line.split(",")
>>>>>  data_dict[serial_id].append(float(duration))
>>>>> except Exception:
>>>>> pass
>>>>> for k,v in data_dict.items():
>>>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>>>
>>>>>
>>>>>
>>>>> I am expecting a single mean,median to each key
>>>>> But I see id1 duplicated with different mean and median..
>>>>> Any suggestions?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Finding mean and median python streaming

Posted by Yanbo Liang <ya...@gmail.com>.

Because each Reducer is just the local mean and median of the data which
are shuffled to that Reducer.


2013/4/5 jamal sasha <ja...@gmail.com>

> Hi,
>   I have a question.
> Why do I need to set number of reducers to 1.
> Since all the keys are sorted shouldnt most of the keys go to same reducer?
>
>
>
> On Tue, Apr 2, 2013 at 2:14 AM, Yanbo Liang <ya...@gmail.com> wrote:
>
>> How many Reducer did you start for this job?
>> If you start many Reducers for this job, it will produce multiple output
>> file which named as part-*****.
>> And each part is only the local mean and median value of the specific
>> Reducer partition.
>>
>> Two kinds of solutions:
>> 1, Call the method of setNumReduceTasks(1) to set the Reducer number to
>> 1, and it will produce only one output file and each distinct key will
>> produce only one mean and median value.
>> 2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
>> code. It process all the output file which produced by multiple Reducer by
>> a local function, and it produce the ultimate result.
>>
>> BR
>>  Yanbo
>>
>>
>> 2013/4/2 jamal sasha <ja...@gmail.com>
>>
>>> pinging again.
>>> Let me rephrase the question.
>>> If my data is like:
>>> id, value
>>>
>>> And I want to find average "value" for each id, how can i do that using
>>> hadoop streaming?
>>> I am sure, it should be very straightforward but aparently my
>>> understanding of how code works in hadoop streaming is not right.
>>> I would really appreciate if someone can help me with this query.
>>> THanks
>>>
>>>
>>>
>>> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <ja...@gmail.com>wrote:
>>>
>>>> data_dict is declared globably as
>>>> data_dict = defaultdict(list)
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>
>>>>> Very dumb question..
>>>>> I have data as following
>>>>> id1, value
>>>>> 1, 20.2
>>>>> 1,20.4
>>>>> ....
>>>>>
>>>>> I want to find the mean and median of id1?
>>>>> I am using python hadoop streaming..
>>>>> mapper.py
>>>>> for line in sys.stdin:
>>>>> try:
>>>>> # remove leading and trailing whitespace
>>>>>  line = line.rstrip(os.linesep)
>>>>> tokens = line.split(",")
>>>>>  print '%s,%s' % (tokens[0],tokens[1])
>>>>> except Exception:
>>>>> continue
>>>>>
>>>>>
>>>>> reducer.py
>>>>> def mean(data_list):
>>>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>>>> def median(mylist):
>>>>>     sorts = sorted(mylist)
>>>>>     length = len(sorts)
>>>>>     if not length % 2:
>>>>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>>>>     return sorts[length / 2]
>>>>>
>>>>>
>>>>> for line in sys.stdin:
>>>>> try:
>>>>> line = line.rstrip(os.linesep)
>>>>> serial_id, duration = line.split(",")
>>>>>  data_dict[serial_id].append(float(duration))
>>>>> except Exception:
>>>>> pass
>>>>> for k,v in data_dict.items():
>>>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>>>
>>>>>
>>>>>
>>>>> I am expecting a single mean,median to each key
>>>>> But I see id1 duplicated with different mean and median..
>>>>> Any suggestions?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Finding mean and median python streaming

Posted by Yanbo Liang <ya...@gmail.com>.

Because each Reducer is just the local mean and median of the data which
are shuffled to that Reducer.


2013/4/5 jamal sasha <ja...@gmail.com>

> Hi,
>   I have a question.
> Why do I need to set number of reducers to 1.
> Since all the keys are sorted shouldnt most of the keys go to same reducer?
>
>
>
> On Tue, Apr 2, 2013 at 2:14 AM, Yanbo Liang <ya...@gmail.com> wrote:
>
>> How many Reducer did you start for this job?
>> If you start many Reducers for this job, it will produce multiple output
>> file which named as part-*****.
>> And each part is only the local mean and median value of the specific
>> Reducer partition.
>>
>> Two kinds of solutions:
>> 1, Call the method of setNumReduceTasks(1) to set the Reducer number to
>> 1, and it will produce only one output file and each distinct key will
>> produce only one mean and median value.
>> 2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
>> code. It process all the output file which produced by multiple Reducer by
>> a local function, and it produce the ultimate result.
>>
>> BR
>>  Yanbo
>>
>>
>> 2013/4/2 jamal sasha <ja...@gmail.com>
>>
>>> pinging again.
>>> Let me rephrase the question.
>>> If my data is like:
>>> id, value
>>>
>>> And I want to find average "value" for each id, how can i do that using
>>> hadoop streaming?
>>> I am sure, it should be very straightforward but aparently my
>>> understanding of how code works in hadoop streaming is not right.
>>> I would really appreciate if someone can help me with this query.
>>> THanks
>>>
>>>
>>>
>>> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <ja...@gmail.com>wrote:
>>>
>>>> data_dict is declared globably as
>>>> data_dict = defaultdict(list)
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>
>>>>> Very dumb question..
>>>>> I have data as following
>>>>> id1, value
>>>>> 1, 20.2
>>>>> 1,20.4
>>>>> ....
>>>>>
>>>>> I want to find the mean and median of id1?
>>>>> I am using python hadoop streaming..
>>>>> mapper.py
>>>>> for line in sys.stdin:
>>>>> try:
>>>>> # remove leading and trailing whitespace
>>>>>  line = line.rstrip(os.linesep)
>>>>> tokens = line.split(",")
>>>>>  print '%s,%s' % (tokens[0],tokens[1])
>>>>> except Exception:
>>>>> continue
>>>>>
>>>>>
>>>>> reducer.py
>>>>> def mean(data_list):
>>>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>>>> def median(mylist):
>>>>>     sorts = sorted(mylist)
>>>>>     length = len(sorts)
>>>>>     if not length % 2:
>>>>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>>>>     return sorts[length / 2]
>>>>>
>>>>>
>>>>> for line in sys.stdin:
>>>>> try:
>>>>> line = line.rstrip(os.linesep)
>>>>> serial_id, duration = line.split(",")
>>>>>  data_dict[serial_id].append(float(duration))
>>>>> except Exception:
>>>>> pass
>>>>> for k,v in data_dict.items():
>>>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>>>
>>>>>
>>>>>
>>>>> I am expecting a single mean,median to each key
>>>>> But I see id1 duplicated with different mean and median..
>>>>> Any suggestions?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Finding mean and median python streaming

Posted by jamal sasha <ja...@gmail.com>.

Hi,
  I have a question.
Why do I need to set number of reducers to 1.
Since all the keys are sorted shouldnt most of the keys go to same reducer?



On Tue, Apr 2, 2013 at 2:14 AM, Yanbo Liang <ya...@gmail.com> wrote:

> How many Reducer did you start for this job?
> If you start many Reducers for this job, it will produce multiple output
> file which named as part-*****.
> And each part is only the local mean and median value of the specific
> Reducer partition.
>
> Two kinds of solutions:
> 1, Call the method of setNumReduceTasks(1) to set the Reducer number to
> 1, and it will produce only one output file and each distinct key will
> produce only one mean and median value.
> 2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
> code. It process all the output file which produced by multiple Reducer by
> a local function, and it produce the ultimate result.
>
> BR
> Yanbo
>
>
> 2013/4/2 jamal sasha <ja...@gmail.com>
>
>> pinging again.
>> Let me rephrase the question.
>> If my data is like:
>> id, value
>>
>> And I want to find average "value" for each id, how can i do that using
>> hadoop streaming?
>> I am sure, it should be very straightforward but aparently my
>> understanding of how code works in hadoop streaming is not right.
>> I would really appreciate if someone can help me with this query.
>> THanks
>>
>>
>>
>> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> data_dict is declared globably as
>>> data_dict = defaultdict(list)
>>>
>>>
>>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com>wrote:
>>>
>>>> Very dumb question..
>>>> I have data as following
>>>> id1, value
>>>> 1, 20.2
>>>> 1,20.4
>>>> ....
>>>>
>>>> I want to find the mean and median of id1?
>>>> I am using python hadoop streaming..
>>>> mapper.py
>>>> for line in sys.stdin:
>>>> try:
>>>> # remove leading and trailing whitespace
>>>>  line = line.rstrip(os.linesep)
>>>> tokens = line.split(",")
>>>>  print '%s,%s' % (tokens[0],tokens[1])
>>>> except Exception:
>>>> continue
>>>>
>>>>
>>>> reducer.py
>>>> def mean(data_list):
>>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>>> def median(mylist):
>>>>     sorts = sorted(mylist)
>>>>     length = len(sorts)
>>>>     if not length % 2:
>>>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>>>     return sorts[length / 2]
>>>>
>>>>
>>>> for line in sys.stdin:
>>>> try:
>>>> line = line.rstrip(os.linesep)
>>>> serial_id, duration = line.split(",")
>>>>  data_dict[serial_id].append(float(duration))
>>>> except Exception:
>>>> pass
>>>> for k,v in data_dict.items():
>>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>>
>>>>
>>>>
>>>> I am expecting a single mean,median to each key
>>>> But I see id1 duplicated with different mean and median..
>>>> Any suggestions?
>>>>
>>>
>>>
>>
>

Re: Finding mean and median python streaming

Posted by jamal sasha <ja...@gmail.com>.

Hi,
  I have a question.
Why do I need to set number of reducers to 1.
Since all the keys are sorted shouldnt most of the keys go to same reducer?



On Tue, Apr 2, 2013 at 2:14 AM, Yanbo Liang <ya...@gmail.com> wrote:

> How many Reducer did you start for this job?
> If you start many Reducers for this job, it will produce multiple output
> file which named as part-*****.
> And each part is only the local mean and median value of the specific
> Reducer partition.
>
> Two kinds of solutions:
> 1, Call the method of setNumReduceTasks(1) to set the Reducer number to
> 1, and it will produce only one output file and each distinct key will
> produce only one mean and median value.
> 2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
> code. It process all the output file which produced by multiple Reducer by
> a local function, and it produce the ultimate result.
>
> BR
> Yanbo
>
>
> 2013/4/2 jamal sasha <ja...@gmail.com>
>
>> pinging again.
>> Let me rephrase the question.
>> If my data is like:
>> id, value
>>
>> And I want to find average "value" for each id, how can i do that using
>> hadoop streaming?
>> I am sure, it should be very straightforward but aparently my
>> understanding of how code works in hadoop streaming is not right.
>> I would really appreciate if someone can help me with this query.
>> THanks
>>
>>
>>
>> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> data_dict is declared globably as
>>> data_dict = defaultdict(list)
>>>
>>>
>>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com>wrote:
>>>
>>>> Very dumb question..
>>>> I have data as following
>>>> id1, value
>>>> 1, 20.2
>>>> 1,20.4
>>>> ....
>>>>
>>>> I want to find the mean and median of id1?
>>>> I am using python hadoop streaming..
>>>> mapper.py
>>>> for line in sys.stdin:
>>>> try:
>>>> # remove leading and trailing whitespace
>>>>  line = line.rstrip(os.linesep)
>>>> tokens = line.split(",")
>>>>  print '%s,%s' % (tokens[0],tokens[1])
>>>> except Exception:
>>>> continue
>>>>
>>>>
>>>> reducer.py
>>>> def mean(data_list):
>>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>>> def median(mylist):
>>>>     sorts = sorted(mylist)
>>>>     length = len(sorts)
>>>>     if not length % 2:
>>>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>>>     return sorts[length / 2]
>>>>
>>>>
>>>> for line in sys.stdin:
>>>> try:
>>>> line = line.rstrip(os.linesep)
>>>> serial_id, duration = line.split(",")
>>>>  data_dict[serial_id].append(float(duration))
>>>> except Exception:
>>>> pass
>>>> for k,v in data_dict.items():
>>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>>
>>>>
>>>>
>>>> I am expecting a single mean,median to each key
>>>> But I see id1 duplicated with different mean and median..
>>>> Any suggestions?
>>>>
>>>
>>>
>>
>

Re: Finding mean and median python streaming

Posted by jamal sasha <ja...@gmail.com>.

Hi,
  I have a question.
Why do I need to set number of reducers to 1.
Since all the keys are sorted shouldnt most of the keys go to same reducer?



On Tue, Apr 2, 2013 at 2:14 AM, Yanbo Liang <ya...@gmail.com> wrote:

> How many Reducer did you start for this job?
> If you start many Reducers for this job, it will produce multiple output
> file which named as part-*****.
> And each part is only the local mean and median value of the specific
> Reducer partition.
>
> Two kinds of solutions:
> 1, Call the method of setNumReduceTasks(1) to set the Reducer number to
> 1, and it will produce only one output file and each distinct key will
> produce only one mean and median value.
> 2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
> code. It process all the output file which produced by multiple Reducer by
> a local function, and it produce the ultimate result.
>
> BR
> Yanbo
>
>
> 2013/4/2 jamal sasha <ja...@gmail.com>
>
>> pinging again.
>> Let me rephrase the question.
>> If my data is like:
>> id, value
>>
>> And I want to find average "value" for each id, how can i do that using
>> hadoop streaming?
>> I am sure, it should be very straightforward but aparently my
>> understanding of how code works in hadoop streaming is not right.
>> I would really appreciate if someone can help me with this query.
>> THanks
>>
>>
>>
>> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> data_dict is declared globably as
>>> data_dict = defaultdict(list)
>>>
>>>
>>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com>wrote:
>>>
>>>> Very dumb question..
>>>> I have data as following
>>>> id1, value
>>>> 1, 20.2
>>>> 1,20.4
>>>> ....
>>>>
>>>> I want to find the mean and median of id1?
>>>> I am using python hadoop streaming..
>>>> mapper.py
>>>> for line in sys.stdin:
>>>> try:
>>>> # remove leading and trailing whitespace
>>>>  line = line.rstrip(os.linesep)
>>>> tokens = line.split(",")
>>>>  print '%s,%s' % (tokens[0],tokens[1])
>>>> except Exception:
>>>> continue
>>>>
>>>>
>>>> reducer.py
>>>> def mean(data_list):
>>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>>> def median(mylist):
>>>>     sorts = sorted(mylist)
>>>>     length = len(sorts)
>>>>     if not length % 2:
>>>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>>>     return sorts[length / 2]
>>>>
>>>>
>>>> for line in sys.stdin:
>>>> try:
>>>> line = line.rstrip(os.linesep)
>>>> serial_id, duration = line.split(",")
>>>>  data_dict[serial_id].append(float(duration))
>>>> except Exception:
>>>> pass
>>>> for k,v in data_dict.items():
>>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>>
>>>>
>>>>
>>>> I am expecting a single mean,median to each key
>>>> But I see id1 duplicated with different mean and median..
>>>> Any suggestions?
>>>>
>>>
>>>
>>
>

Re: Finding mean and median python streaming

Posted by jamal sasha <ja...@gmail.com>.

Hi,
  I have a question.
Why do I need to set number of reducers to 1.
Since all the keys are sorted shouldnt most of the keys go to same reducer?



On Tue, Apr 2, 2013 at 2:14 AM, Yanbo Liang <ya...@gmail.com> wrote:

> How many Reducer did you start for this job?
> If you start many Reducers for this job, it will produce multiple output
> file which named as part-*****.
> And each part is only the local mean and median value of the specific
> Reducer partition.
>
> Two kinds of solutions:
> 1, Call the method of setNumReduceTasks(1) to set the Reducer number to
> 1, and it will produce only one output file and each distinct key will
> produce only one mean and median value.
> 2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
> code. It process all the output file which produced by multiple Reducer by
> a local function, and it produce the ultimate result.
>
> BR
> Yanbo
>
>
> 2013/4/2 jamal sasha <ja...@gmail.com>
>
>> pinging again.
>> Let me rephrase the question.
>> If my data is like:
>> id, value
>>
>> And I want to find average "value" for each id, how can i do that using
>> hadoop streaming?
>> I am sure, it should be very straightforward but aparently my
>> understanding of how code works in hadoop streaming is not right.
>> I would really appreciate if someone can help me with this query.
>> THanks
>>
>>
>>
>> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> data_dict is declared globably as
>>> data_dict = defaultdict(list)
>>>
>>>
>>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com>wrote:
>>>
>>>> Very dumb question..
>>>> I have data as following
>>>> id1, value
>>>> 1, 20.2
>>>> 1,20.4
>>>> ....
>>>>
>>>> I want to find the mean and median of id1?
>>>> I am using python hadoop streaming..
>>>> mapper.py
>>>> for line in sys.stdin:
>>>> try:
>>>> # remove leading and trailing whitespace
>>>>  line = line.rstrip(os.linesep)
>>>> tokens = line.split(",")
>>>>  print '%s,%s' % (tokens[0],tokens[1])
>>>> except Exception:
>>>> continue
>>>>
>>>>
>>>> reducer.py
>>>> def mean(data_list):
>>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>>> def median(mylist):
>>>>     sorts = sorted(mylist)
>>>>     length = len(sorts)
>>>>     if not length % 2:
>>>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>>>     return sorts[length / 2]
>>>>
>>>>
>>>> for line in sys.stdin:
>>>> try:
>>>> line = line.rstrip(os.linesep)
>>>> serial_id, duration = line.split(",")
>>>>  data_dict[serial_id].append(float(duration))
>>>> except Exception:
>>>> pass
>>>> for k,v in data_dict.items():
>>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>>
>>>>
>>>>
>>>> I am expecting a single mean,median to each key
>>>> But I see id1 duplicated with different mean and median..
>>>> Any suggestions?
>>>>
>>>
>>>
>>
>

Re: Finding mean and median python streaming

Posted by Yanbo Liang <ya...@gmail.com>.

How many Reducer did you start for this job?
If you start many Reducers for this job, it will produce multiple output
file which named as part-*****.
And each part is only the local mean and median value of the specific
Reducer partition.

Two kinds of solutions:
1, Call the method of setNumReduceTasks(1) to set the Reducer number to 1,
and it will produce only one output file and each distinct key will produce
only one mean and median value.
2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
code. It process all the output file which produced by multiple Reducer by
a local function, and it produce the ultimate result.

BR
Yanbo


2013/4/2 jamal sasha <ja...@gmail.com>

> pinging again.
> Let me rephrase the question.
> If my data is like:
> id, value
>
> And I want to find average "value" for each id, how can i do that using
> hadoop streaming?
> I am sure, it should be very straightforward but aparently my
> understanding of how code works in hadoop streaming is not right.
> I would really appreciate if someone can help me with this query.
> THanks
>
>
>
> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <ja...@gmail.com> wrote:
>
>> data_dict is declared globably as
>> data_dict = defaultdict(list)
>>
>>
>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Very dumb question..
>>> I have data as following
>>> id1, value
>>> 1, 20.2
>>> 1,20.4
>>> ....
>>>
>>> I want to find the mean and median of id1?
>>> I am using python hadoop streaming..
>>> mapper.py
>>> for line in sys.stdin:
>>> try:
>>> # remove leading and trailing whitespace
>>>  line = line.rstrip(os.linesep)
>>> tokens = line.split(",")
>>>  print '%s,%s' % (tokens[0],tokens[1])
>>> except Exception:
>>> continue
>>>
>>>
>>> reducer.py
>>> def mean(data_list):
>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>> def median(mylist):
>>>     sorts = sorted(mylist)
>>>     length = len(sorts)
>>>     if not length % 2:
>>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>>     return sorts[length / 2]
>>>
>>>
>>> for line in sys.stdin:
>>> try:
>>> line = line.rstrip(os.linesep)
>>> serial_id, duration = line.split(",")
>>>  data_dict[serial_id].append(float(duration))
>>> except Exception:
>>> pass
>>> for k,v in data_dict.items():
>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>
>>>
>>>
>>> I am expecting a single mean,median to each key
>>> But I see id1 duplicated with different mean and median..
>>> Any suggestions?
>>>
>>
>>
>

Re: Finding mean and median python streaming

Posted by Yanbo Liang <ya...@gmail.com>.

How many Reducer did you start for this job?
If you start many Reducers for this job, it will produce multiple output
file which named as part-*****.
And each part is only the local mean and median value of the specific
Reducer partition.

Two kinds of solutions:
1, Call the method of setNumReduceTasks(1) to set the Reducer number to 1,
and it will produce only one output file and each distinct key will produce
only one mean and median value.
2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
code. It process all the output file which produced by multiple Reducer by
a local function, and it produce the ultimate result.

BR
Yanbo


2013/4/2 jamal sasha <ja...@gmail.com>

> pinging again.
> Let me rephrase the question.
> If my data is like:
> id, value
>
> And I want to find average "value" for each id, how can i do that using
> hadoop streaming?
> I am sure, it should be very straightforward but aparently my
> understanding of how code works in hadoop streaming is not right.
> I would really appreciate if someone can help me with this query.
> THanks
>
>
>
> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <ja...@gmail.com> wrote:
>
>> data_dict is declared globably as
>> data_dict = defaultdict(list)
>>
>>
>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Very dumb question..
>>> I have data as following
>>> id1, value
>>> 1, 20.2
>>> 1,20.4
>>> ....
>>>
>>> I want to find the mean and median of id1?
>>> I am using python hadoop streaming..
>>> mapper.py
>>> for line in sys.stdin:
>>> try:
>>> # remove leading and trailing whitespace
>>>  line = line.rstrip(os.linesep)
>>> tokens = line.split(",")
>>>  print '%s,%s' % (tokens[0],tokens[1])
>>> except Exception:
>>> continue
>>>
>>>
>>> reducer.py
>>> def mean(data_list):
>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>> def median(mylist):
>>>     sorts = sorted(mylist)
>>>     length = len(sorts)
>>>     if not length % 2:
>>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>>     return sorts[length / 2]
>>>
>>>
>>> for line in sys.stdin:
>>> try:
>>> line = line.rstrip(os.linesep)
>>> serial_id, duration = line.split(",")
>>>  data_dict[serial_id].append(float(duration))
>>> except Exception:
>>> pass
>>> for k,v in data_dict.items():
>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>
>>>
>>>
>>> I am expecting a single mean,median to each key
>>> But I see id1 duplicated with different mean and median..
>>> Any suggestions?
>>>
>>
>>
>

Re: Finding mean and median python streaming

Posted by Yanbo Liang <ya...@gmail.com>.

How many Reducer did you start for this job?
If you start many Reducers for this job, it will produce multiple output
file which named as part-*****.
And each part is only the local mean and median value of the specific
Reducer partition.

Two kinds of solutions:
1, Call the method of setNumReduceTasks(1) to set the Reducer number to 1,
and it will produce only one output file and each distinct key will produce
only one mean and median value.
2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
code. It process all the output file which produced by multiple Reducer by
a local function, and it produce the ultimate result.

BR
Yanbo


2013/4/2 jamal sasha <ja...@gmail.com>

> pinging again.
> Let me rephrase the question.
> If my data is like:
> id, value
>
> And I want to find average "value" for each id, how can i do that using
> hadoop streaming?
> I am sure, it should be very straightforward but aparently my
> understanding of how code works in hadoop streaming is not right.
> I would really appreciate if someone can help me with this query.
> THanks
>
>
>
> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <ja...@gmail.com> wrote:
>
>> data_dict is declared globably as
>> data_dict = defaultdict(list)
>>
>>
>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Very dumb question..
>>> I have data as following
>>> id1, value
>>> 1, 20.2
>>> 1,20.4
>>> ....
>>>
>>> I want to find the mean and median of id1?
>>> I am using python hadoop streaming..
>>> mapper.py
>>> for line in sys.stdin:
>>> try:
>>> # remove leading and trailing whitespace
>>>  line = line.rstrip(os.linesep)
>>> tokens = line.split(",")
>>>  print '%s,%s' % (tokens[0],tokens[1])
>>> except Exception:
>>> continue
>>>
>>>
>>> reducer.py
>>> def mean(data_list):
>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>> def median(mylist):
>>>     sorts = sorted(mylist)
>>>     length = len(sorts)
>>>     if not length % 2:
>>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>>     return sorts[length / 2]
>>>
>>>
>>> for line in sys.stdin:
>>> try:
>>> line = line.rstrip(os.linesep)
>>> serial_id, duration = line.split(",")
>>>  data_dict[serial_id].append(float(duration))
>>> except Exception:
>>> pass
>>> for k,v in data_dict.items():
>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>
>>>
>>>
>>> I am expecting a single mean,median to each key
>>> But I see id1 duplicated with different mean and median..
>>> Any suggestions?
>>>
>>
>>
>

Re: Finding mean and median python streaming

Posted by Yanbo Liang <ya...@gmail.com>.

How many Reducer did you start for this job?
If you start many Reducers for this job, it will produce multiple output
file which named as part-*****.
And each part is only the local mean and median value of the specific
Reducer partition.

Two kinds of solutions:
1, Call the method of setNumReduceTasks(1) to set the Reducer number to 1,
and it will produce only one output file and each distinct key will produce
only one mean and median value.
2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
code. It process all the output file which produced by multiple Reducer by
a local function, and it produce the ultimate result.

BR
Yanbo


2013/4/2 jamal sasha <ja...@gmail.com>

> pinging again.
> Let me rephrase the question.
> If my data is like:
> id, value
>
> And I want to find average "value" for each id, how can i do that using
> hadoop streaming?
> I am sure, it should be very straightforward but aparently my
> understanding of how code works in hadoop streaming is not right.
> I would really appreciate if someone can help me with this query.
> THanks
>
>
>
> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <ja...@gmail.com> wrote:
>
>> data_dict is declared globably as
>> data_dict = defaultdict(list)
>>
>>
>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Very dumb question..
>>> I have data as following
>>> id1, value
>>> 1, 20.2
>>> 1,20.4
>>> ....
>>>
>>> I want to find the mean and median of id1?
>>> I am using python hadoop streaming..
>>> mapper.py
>>> for line in sys.stdin:
>>> try:
>>> # remove leading and trailing whitespace
>>>  line = line.rstrip(os.linesep)
>>> tokens = line.split(",")
>>>  print '%s,%s' % (tokens[0],tokens[1])
>>> except Exception:
>>> continue
>>>
>>>
>>> reducer.py
>>> def mean(data_list):
>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>> def median(mylist):
>>>     sorts = sorted(mylist)
>>>     length = len(sorts)
>>>     if not length % 2:
>>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>>     return sorts[length / 2]
>>>
>>>
>>> for line in sys.stdin:
>>> try:
>>> line = line.rstrip(os.linesep)
>>> serial_id, duration = line.split(",")
>>>  data_dict[serial_id].append(float(duration))
>>> except Exception:
>>> pass
>>> for k,v in data_dict.items():
>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>
>>>
>>>
>>> I am expecting a single mean,median to each key
>>> But I see id1 duplicated with different mean and median..
>>> Any suggestions?
>>>
>>
>>
>

Re: Finding mean and median python streaming

Posted by jamal sasha <ja...@gmail.com>.

pinging again.
Let me rephrase the question.
If my data is like:
id, value

And I want to find average "value" for each id, how can i do that using
hadoop streaming?
I am sure, it should be very straightforward but aparently my understanding
of how code works in hadoop streaming is not right.
I would really appreciate if someone can help me with this query.
THanks



On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <ja...@gmail.com> wrote:

> data_dict is declared globably as
> data_dict = defaultdict(list)
>
>
> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com> wrote:
>
>> Very dumb question..
>> I have data as following
>> id1, value
>> 1, 20.2
>> 1,20.4
>> ....
>>
>> I want to find the mean and median of id1?
>> I am using python hadoop streaming..
>> mapper.py
>> for line in sys.stdin:
>> try:
>> # remove leading and trailing whitespace
>>  line = line.rstrip(os.linesep)
>> tokens = line.split(",")
>>  print '%s,%s' % (tokens[0],tokens[1])
>> except Exception:
>> continue
>>
>>
>> reducer.py
>> def mean(data_list):
>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>> def median(mylist):
>>     sorts = sorted(mylist)
>>     length = len(sorts)
>>     if not length % 2:
>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>     return sorts[length / 2]
>>
>>
>> for line in sys.stdin:
>> try:
>> line = line.rstrip(os.linesep)
>> serial_id, duration = line.split(",")
>>  data_dict[serial_id].append(float(duration))
>> except Exception:
>> pass
>> for k,v in data_dict.items():
>> print "%s,%s,%s" %(k, mean(v), median(v))
>>
>>
>>
>> I am expecting a single mean,median to each key
>> But I see id1 duplicated with different mean and median..
>> Any suggestions?
>>
>
>

Re: Finding mean and median python streaming

Posted by jamal sasha <ja...@gmail.com>.

pinging again.
Let me rephrase the question.
If my data is like:
id, value

And I want to find average "value" for each id, how can i do that using
hadoop streaming?
I am sure, it should be very straightforward but aparently my understanding
of how code works in hadoop streaming is not right.
I would really appreciate if someone can help me with this query.
THanks



On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <ja...@gmail.com> wrote:

> data_dict is declared globably as
> data_dict = defaultdict(list)
>
>
> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com> wrote:
>
>> Very dumb question..
>> I have data as following
>> id1, value
>> 1, 20.2
>> 1,20.4
>> ....
>>
>> I want to find the mean and median of id1?
>> I am using python hadoop streaming..
>> mapper.py
>> for line in sys.stdin:
>> try:
>> # remove leading and trailing whitespace
>>  line = line.rstrip(os.linesep)
>> tokens = line.split(",")
>>  print '%s,%s' % (tokens[0],tokens[1])
>> except Exception:
>> continue
>>
>>
>> reducer.py
>> def mean(data_list):
>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>> def median(mylist):
>>     sorts = sorted(mylist)
>>     length = len(sorts)
>>     if not length % 2:
>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>     return sorts[length / 2]
>>
>>
>> for line in sys.stdin:
>> try:
>> line = line.rstrip(os.linesep)
>> serial_id, duration = line.split(",")
>>  data_dict[serial_id].append(float(duration))
>> except Exception:
>> pass
>> for k,v in data_dict.items():
>> print "%s,%s,%s" %(k, mean(v), median(v))
>>
>>
>>
>> I am expecting a single mean,median to each key
>> But I see id1 duplicated with different mean and median..
>> Any suggestions?
>>
>
>

Re: Finding mean and median python streaming

Posted by jamal sasha <ja...@gmail.com>.

pinging again.
Let me rephrase the question.
If my data is like:
id, value

And I want to find average "value" for each id, how can i do that using
hadoop streaming?
I am sure, it should be very straightforward but aparently my understanding
of how code works in hadoop streaming is not right.
I would really appreciate if someone can help me with this query.
THanks



On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <ja...@gmail.com> wrote:

> data_dict is declared globably as
> data_dict = defaultdict(list)
>
>
> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com> wrote:
>
>> Very dumb question..
>> I have data as following
>> id1, value
>> 1, 20.2
>> 1,20.4
>> ....
>>
>> I want to find the mean and median of id1?
>> I am using python hadoop streaming..
>> mapper.py
>> for line in sys.stdin:
>> try:
>> # remove leading and trailing whitespace
>>  line = line.rstrip(os.linesep)
>> tokens = line.split(",")
>>  print '%s,%s' % (tokens[0],tokens[1])
>> except Exception:
>> continue
>>
>>
>> reducer.py
>> def mean(data_list):
>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>> def median(mylist):
>>     sorts = sorted(mylist)
>>     length = len(sorts)
>>     if not length % 2:
>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>     return sorts[length / 2]
>>
>>
>> for line in sys.stdin:
>> try:
>> line = line.rstrip(os.linesep)
>> serial_id, duration = line.split(",")
>>  data_dict[serial_id].append(float(duration))
>> except Exception:
>> pass
>> for k,v in data_dict.items():
>> print "%s,%s,%s" %(k, mean(v), median(v))
>>
>>
>>
>> I am expecting a single mean,median to each key
>> But I see id1 duplicated with different mean and median..
>> Any suggestions?
>>
>
>

Re: Finding mean and median python streaming

Posted by jamal sasha <ja...@gmail.com>.

pinging again.
Let me rephrase the question.
If my data is like:
id, value

And I want to find average "value" for each id, how can i do that using
hadoop streaming?
I am sure, it should be very straightforward but aparently my understanding
of how code works in hadoop streaming is not right.
I would really appreciate if someone can help me with this query.
THanks



On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <ja...@gmail.com> wrote:

> data_dict is declared globably as
> data_dict = defaultdict(list)
>
>
> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com> wrote:
>
>> Very dumb question..
>> I have data as following
>> id1, value
>> 1, 20.2
>> 1,20.4
>> ....
>>
>> I want to find the mean and median of id1?
>> I am using python hadoop streaming..
>> mapper.py
>> for line in sys.stdin:
>> try:
>> # remove leading and trailing whitespace
>>  line = line.rstrip(os.linesep)
>> tokens = line.split(",")
>>  print '%s,%s' % (tokens[0],tokens[1])
>> except Exception:
>> continue
>>
>>
>> reducer.py
>> def mean(data_list):
>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>> def median(mylist):
>>     sorts = sorted(mylist)
>>     length = len(sorts)
>>     if not length % 2:
>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>     return sorts[length / 2]
>>
>>
>> for line in sys.stdin:
>> try:
>> line = line.rstrip(os.linesep)
>> serial_id, duration = line.split(",")
>>  data_dict[serial_id].append(float(duration))
>> except Exception:
>> pass
>> for k,v in data_dict.items():
>> print "%s,%s,%s" %(k, mean(v), median(v))
>>
>>
>>
>> I am expecting a single mean,median to each key
>> But I see id1 duplicated with different mean and median..
>> Any suggestions?
>>
>
>

Re: Finding mean and median python streaming

Posted by jamal sasha <ja...@gmail.com>.

data_dict is declared globably as
data_dict = defaultdict(list)


On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com> wrote:

> Very dumb question..
> I have data as following
> id1, value
> 1, 20.2
> 1,20.4
> ....
>
> I want to find the mean and median of id1?
> I am using python hadoop streaming..
> mapper.py
> for line in sys.stdin:
> try:
> # remove leading and trailing whitespace
>  line = line.rstrip(os.linesep)
> tokens = line.split(",")
>  print '%s,%s' % (tokens[0],tokens[1])
> except Exception:
> continue
>
>
> reducer.py
> def mean(data_list):
> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
> def median(mylist):
>     sorts = sorted(mylist)
>     length = len(sorts)
>     if not length % 2:
>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>     return sorts[length / 2]
>
>
> for line in sys.stdin:
> try:
> line = line.rstrip(os.linesep)
> serial_id, duration = line.split(",")
>  data_dict[serial_id].append(float(duration))
> except Exception:
> pass
> for k,v in data_dict.items():
> print "%s,%s,%s" %(k, mean(v), median(v))
>
>
>
> I am expecting a single mean,median to each key
> But I see id1 duplicated with different mean and median..
> Any suggestions?
>

Re: Finding mean and median python streaming

Posted by jamal sasha <ja...@gmail.com>.

data_dict is declared globably as
data_dict = defaultdict(list)


On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com> wrote:

> Very dumb question..
> I have data as following
> id1, value
> 1, 20.2
> 1,20.4
> ....
>
> I want to find the mean and median of id1?
> I am using python hadoop streaming..
> mapper.py
> for line in sys.stdin:
> try:
> # remove leading and trailing whitespace
>  line = line.rstrip(os.linesep)
> tokens = line.split(",")
>  print '%s,%s' % (tokens[0],tokens[1])
> except Exception:
> continue
>
>
> reducer.py
> def mean(data_list):
> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
> def median(mylist):
>     sorts = sorted(mylist)
>     length = len(sorts)
>     if not length % 2:
>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>     return sorts[length / 2]
>
>
> for line in sys.stdin:
> try:
> line = line.rstrip(os.linesep)
> serial_id, duration = line.split(",")
>  data_dict[serial_id].append(float(duration))
> except Exception:
> pass
> for k,v in data_dict.items():
> print "%s,%s,%s" %(k, mean(v), median(v))
>
>
>
> I am expecting a single mean,median to each key
> But I see id1 duplicated with different mean and median..
> Any suggestions?
>

Re: Finding mean and median python streaming

Posted by jamal sasha <ja...@gmail.com>.

data_dict is declared globably as
data_dict = defaultdict(list)


On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com> wrote:

> Very dumb question..
> I have data as following
> id1, value
> 1, 20.2
> 1,20.4
> ....
>
> I want to find the mean and median of id1?
> I am using python hadoop streaming..
> mapper.py
> for line in sys.stdin:
> try:
> # remove leading and trailing whitespace
>  line = line.rstrip(os.linesep)
> tokens = line.split(",")
>  print '%s,%s' % (tokens[0],tokens[1])
> except Exception:
> continue
>
>
> reducer.py
> def mean(data_list):
> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
> def median(mylist):
>     sorts = sorted(mylist)
>     length = len(sorts)
>     if not length % 2:
>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>     return sorts[length / 2]
>
>
> for line in sys.stdin:
> try:
> line = line.rstrip(os.linesep)
> serial_id, duration = line.split(",")
>  data_dict[serial_id].append(float(duration))
> except Exception:
> pass
> for k,v in data_dict.items():
> print "%s,%s,%s" %(k, mean(v), median(v))
>
>
>
> I am expecting a single mean,median to each key
> But I see id1 duplicated with different mean and median..
> Any suggestions?
>

Re: Finding mean and median python streaming

Posted by jamal sasha <ja...@gmail.com>.

data_dict is declared globably as
data_dict = defaultdict(list)


On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <ja...@gmail.com> wrote:

> Very dumb question..
> I have data as following
> id1, value
> 1, 20.2
> 1,20.4
> ....
>
> I want to find the mean and median of id1?
> I am using python hadoop streaming..
> mapper.py
> for line in sys.stdin:
> try:
> # remove leading and trailing whitespace
>  line = line.rstrip(os.linesep)
> tokens = line.split(",")
>  print '%s,%s' % (tokens[0],tokens[1])
> except Exception:
> continue
>
>
> reducer.py
> def mean(data_list):
> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
> def median(mylist):
>     sorts = sorted(mylist)
>     length = len(sorts)
>     if not length % 2:
>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>     return sorts[length / 2]
>
>
> for line in sys.stdin:
> try:
> line = line.rstrip(os.linesep)
> serial_id, duration = line.split(",")
>  data_dict[serial_id].append(float(duration))
> except Exception:
> pass
> for k,v in data_dict.items():
> print "%s,%s,%s" %(k, mean(v), median(v))
>
>
>
> I am expecting a single mean,median to each key
> But I see id1 duplicated with different mean and median..
> Any suggestions?
>