You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@climate.apache.org by "Alex Goodman (JIRA)" <ji...@apache.org> on 2013/08/08 18:33:50 UTC

[jira] [Commented] (CLIMATE-248) PERFORMANCE - Rebinning Daily to Monthly datasets takes a really long time

    [ https://issues.apache.org/jira/browse/CLIMATE-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733654#comment-13733654 ] 

Alex Goodman commented on CLIMATE-248:
--------------------------------------

Cam,

Here are a quick improvments I can think of:

First, there are many times where the code calls:
np.unique(timeunits)

You should only make this call once and store it into a variable, then when it is needed again you just use that variable.

Also, remove this line:
datamask_at_this_timeunit = np.zeros_like(data)

And change the following line to:
datamask_at_this_timeunit = process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)

A key thing you should know about numpy: Constructing new arrays is very expensive, so if it is not necessary then we should not do it. Since process.create_mask_using_threshold() builds a new array anyway, there is not need to allocate a new array with the same shape in advance. Likewise with np.unique(), a new array is allocated each time it gets called, so doing it only once should also save you some time.

I don't expect these changes to give you much improvement in performance, maybe about 30% at the most. I suspect that the bottleneck is a result of the rebinning being done in a loop rather than a single vectorized numpy array operation. While it is obvious that python loops are inherently slow, there are other consequences at play too. One is that the output array has to be pre-allocated (line 316):

meanstore = np.zeros([len(np.unique(timeunits)),data.shape[1],data.shape[2]])

Which I have already explained why that will slow things down. Additionally, there is this line too inside the loop:
datamask_at_this_timeunit = process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)

Specifically the part where the boolean masked array is being created: data[timeunits==myunit,:,:]. This is an example of what numpy refers to as fancy indexing, the distinction being that such a slice of an array is a copy, while regular slice indexing (eg data[3:4]) only shows a view of the original array. 

Unfortunately, fixing this is not trivial due to the inherent nature of daily data, specifically each month of the year does not have the same number of total days. As a result it is not possible to simply reshape the original data and then use one vectorized numpy operation like I did with monthly data. The best way to get around this in numpy that I could think of would be to use a masked array with shape (number_of_months, 31, ...), then have the months with less than 31 days have masked values in the excess days. You can then get the monthly means really easily with something like new_array.mean(axis=1). This is far from an ideal solution though because 1), inserting the masked values into the original array in the right spots can be tricky and 2), this cannot be done on the original array in place so allocating another new array is necessary. 

The absolute fastest way to do this then is to make your own python extension using C, C++, or Fortran where loops are much quicker, and then call a wrapper to the compiled code in python. As we all know though this is opening another large can of worms and will probably be a lot of work in of itself. I'd say too much for just this particular issue, but in the long term it might be necessary if we want to support daily data.

Alex
                
> PERFORMANCE - Rebinning Daily to Monthly datasets takes a really long time
> --------------------------------------------------------------------------
>
>                 Key: CLIMATE-248
>                 URL: https://issues.apache.org/jira/browse/CLIMATE-248
>             Project: Apache Open Climate Workbench
>          Issue Type: Improvement
>          Components: regridding
>    Affects Versions: 0.1-incubating, 0.2-incubating
>         Environment: *nix
>            Reporter: Cameron Goodale
>            Assignee: Cameron Goodale
>              Labels: performance
>             Fix For: 0.3-incubating
>
>
> When I was testing the dataset_processor module I noticed that most tests would complete in less than 1 second.  Then I came across the "test_daily_to_monthly_rebin" test and it would take over 2 minutes to complete.
> The test initially used a 1x1 degree grid covering the globe and daily time step for 2 years (730 days).
> I ran some initial checks and the lag appears to be down in the code where the data is rebinned down in '_rcmes_calc_average_on_new_time_unit_K'.
> {code}
>                 mask = np.zeros_like(data)
>                 mask[timeunits!=myunit,:,:] = 1.0
>                 # Calculate missing data mask within each time unit...
>                 datamask_at_this_timeunit = np.zeros_like(data)
>                 datamask_at_this_timeunit[:]= process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)
>                 # Store results for masking later
>                 datamask_store.append(datamask_at_this_timeunit[0])
>                 # Calculate means for each pixel in this time unit, ignoring missing data (using masked array).
>                 datam = ma.masked_array(data,np.logical_or(mask,datamask_at_this_timeunit))
>                 meanstore[i,:,:] = ma.average(datam,axis=0)
> {code}
> That block is suspect since the rest of the code is doing simple string parsing and appending to lists.  I don't have the time to do a deep dive into this now, and it technically isn't broken, but just really slow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira