You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sdap.apache.org by GitBox <gi...@apache.org> on 2021/09/02 22:05:04 UTC

[GitHub] [incubator-sdap-nexus] skorper opened a new pull request #132: SDAP-338: Update match up implementation to support multi-variable tiles

skorper opened a new pull request #132:
URL: https://github.com/apache/incubator-sdap-nexus/pull/132


   https://issues.apache.org/jira/browse/SDAP-338
   
   Added support for multi-var tiles in the matchup algorithm. This support exists for multi-variable swath and gridded tiles. 
   
   I went back and forth a lot about the best way to do this. In the end, I decided the best way for now is to make it backwards compatible so the other algorithms can be updated as-needed. This means all algorithms should work as normal on single-var tiles, but would need to update their code to work with multi-var. 
   
   An alternative idea would be to update our tile representation drastically, which would require all algorithms to be updated for both single and multi var tiles. This might be the 'right' way to do this at some point, but for now the approach I went with is the one with the smallest impact on other SDAP algorithms.
   
   In order to support this I made the following changes:
   
   - Modified `Tile`:
     - Added a new variable `is_multi` which is True when the tile is multi-var.
     - Changed the name of `var_name` to `var_names`. In the single-var case, this will be a list of size 1.
   - In `CassandraProxy`, added a new case in `get_lat_lon_time_data_meta` for `swath_multi_variable_tile` and `grid_multi_variable_tile`.
     - In both cases, the 'num vars' dimension is moved to the front of the nd array. This is counter to the shape provided by the ingester. The ingester has the 'num vars' dimension at the end of the nd array. For example, let's say the data is size 30 x 30 x 30, and there are two variables. That means the ingester will store the tile data as 30 x 30 x 30 x 2, whereas in `get_lat_lon_time_data_meta` I'm transforming that to 2 x 30 x 30 x 30. The reason for this is because the algorithm doesn't really have to change, you can run the same algorithm on `data[0], data[1], ...` and it should just work. I'm definitely looking for feedback on this decision, because I can see the argument both ways.
     - In the single var case, the data size would be 30 x 30 x 30 (using the example from above) without the extra dimension. The is_multi flag needs to be used by algorithms to determine which shape to expect the data.
   - Updated `nexustiles._solr_docs_to_tiles` to parse var_names from `var_name` field, where that field is a JSON encoded array (according to William)
   - Updated `Tile` to utilize dataclasses. This just cleaned up the code and allowed removal of boilerplate code.
   - Updated code in various places to work with the case where Tile.is_multi == True.
   - Added docstrings as I was going where missing and where clarification was probably needed due to changes.
   - Added more unit tests to test_matchup.py. Updated existing unit tests to work with changes.
   
   The matchup algorithm was tested with:
   
   1. Single-var sat to in-situ
   2. Single-var sat to single-var sat
   3. Multi-var sat to single-var sat
   4. Multi-var sat to multi-var sat
   5. Multi-var sat to in-situ
   
   NOTE: I'm still working on adding additional unit tests for `nexustiles` but thought I'd open the PR ASAP because it's large. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-sdap-nexus] skorper commented on a change in pull request #132: SDAP-338: Update match up implementation to support multi-variable tiles

Posted by GitBox <gi...@apache.org>.
skorper commented on a change in pull request #132:
URL: https://github.com/apache/incubator-sdap-nexus/pull/132#discussion_r709632309



##########
File path: data-access/nexustiles/model/nexusmodel.py
##########
@@ -100,6 +136,8 @@ def nexus_point_generator(self, include_nan=False):
     def get_indices(self, include_nan=False):
         if include_nan:
             return list(np.ndindex(self.data.shape))
+        if self.is_multi:
+            return np.argwhere(self.data[0])

Review comment:
       I'm beginning to doubt this after a comment by Vardis during the SDAP meeting today -- if it's possible each variable is masked differently we would need to get the mask separately per-variable. This would have major performance implications for swath datasets so I need to give this more thought. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-sdap-nexus] ngachung merged pull request #132: SDAP-338: Update match up implementation to support multi-variable tiles

Posted by GitBox <gi...@apache.org>.
ngachung merged pull request #132:
URL: https://github.com/apache/incubator-sdap-nexus/pull/132


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-sdap-nexus] frankinspace commented on a change in pull request #132: SDAP-338: Update match up implementation to support multi-variable tiles

Posted by GitBox <gi...@apache.org>.
frankinspace commented on a change in pull request #132:
URL: https://github.com/apache/incubator-sdap-nexus/pull/132#discussion_r709661092



##########
File path: data-access/nexustiles/model/nexusmodel.py
##########
@@ -100,6 +136,8 @@ def nexus_point_generator(self, include_nan=False):
     def get_indices(self, include_nan=False):
         if include_nan:
             return list(np.ndindex(self.data.shape))
+        if self.is_multi:
+            return np.argwhere(self.data[0])

Review comment:
       I may not totally understand your concern but, I don't think that's exactly what he meant. For two variables in the same collection, if they came from different instruments and/or have different quality attributes to them it is possible that there is a measurement at index i,j for variable 1 but not at index i,j for variable 2.
   
   The "mask" is already part of the data (in the form of fill values). When read by numpy they get turned into masked arrays where the mask is part of the data array itself. So you don't need to look up a mask based on the variable being read, it's just part of the data




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-sdap-nexus] ngachung merged pull request #132: SDAP-338: Update match up implementation to support multi-variable tiles

Posted by GitBox <gi...@apache.org>.
ngachung merged pull request #132:
URL: https://github.com/apache/incubator-sdap-nexus/pull/132


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-sdap-nexus] skorper commented on pull request #132: SDAP-338: Update match up implementation to support multi-variable tiles

Posted by GitBox <gi...@apache.org>.
skorper commented on pull request #132:
URL: https://github.com/apache/incubator-sdap-nexus/pull/132#issuecomment-929755901


   @ngachung In the latest commit, I've updated matchup to work with the new multi-var solr doc structure. I also updated the unit tests and tested locally:
   
   ```
   {{big_data_url}}/match_spark?primary=ascat-l2-coastal-multi-t6&startTime=2017-01-02T18:59:37Z&endTime=2017-01-02T19:00:31Z&tt=86400&rt=100000&b=-126,44,-119,49&platforms=1,2,3,4,5,6,7,8,9&depthMin=0&depthMax=5&matchOnce=true&matchup=icoads
   ```
   
   ```json
   {
       "executionId": "19188eca-6f96-4fab-9a3d-362787058001",
       "data": [
           {
               "sea_water_temperature": null,
               "sea_water_temperature_depth": 0,
               "sea_water_salinity": null,
               "sea_water_salinity_depth": null,
               "wind_speed": null,
               "wind_direction": null,
               "wind_u": null,
               "wind_v": null,
               "platform": "orbiting satellite",
               "device": "radiometers",
               "x": "-125.24456999999998",
               "y": "47.279900000000005",
               "point": "Point(-125.24456999999998 47.279900000000005)",
               "time": 1483383611,
               "fileurl": "ascat_20170102_182400_metopb_22278_eps_o_coa_2401_ovw.l2.nc",
               "id": "dca2f9c2-eafa-3e20-a8df-4e6a815bedbb[[567 567 567]]",
               "source": "ascat-l2-coastal-multi-t6",
               "data": [
                   {
                       "variable_name": "wind_speed",
                       "cf_variable_name": "wind_speed",
                       "variable_value": 9.130000114440918
                   },
                   {
                       "variable_name": "wind_dir",
                       "cf_variable_name": "wind_to_direction",
                       "variable_value": 252.1999969482422
                   }
               ],
               "matches": [
                   {
                       "sea_water_temperature": 8.7,
                       "sea_water_temperature_depth": null,
                       "sea_water_salinity": null,
                       "sea_water_salinity_depth": null,
                       "wind_speed": 8.0,
                       "wind_direction": null,
                       "wind_u": -6.1,
                       "wind_v": -5.1,
                       "platform": "moored surface buoy",
                       "device": null,
                       "x": "-124.7",
                       "y": "47.3",
                       "point": "Point(-124.7 47.3)",
                       "time": 1483361388,
                       "fileurl": null,
                       "id": "N2H7LY",
                       "source": "icoads",
                       "data": [
                           {
                               "variable_name": "eastward_wind",
                               "cf_variable_name": null,
                               "variable_value": -6.1
                           },
                           {
                               "variable_name": "northward_wind",
                               "cf_variable_name": null,
                               "variable_value": -5.1
                           },
                           {
                               "variable_name": "wind_speed",
                               "cf_variable_name": null,
                               "variable_value": 8.0
                           },
                           {
                               "variable_name": "sea_water_temperature",
                               "cf_variable_name": null,
                               "variable_value": 8.7
                           }
                       ]
                   }
               ]
           },
   ...
   ```
   
   (matchup results truncated)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-sdap-nexus] skorper commented on pull request #132: SDAP-338: Update match up implementation to support multi-variable tiles

Posted by GitBox <gi...@apache.org>.
skorper commented on pull request #132:
URL: https://github.com/apache/incubator-sdap-nexus/pull/132#issuecomment-930303660


   I've added backwards compatibility with the previous iteration of the solr doc. Example solr doc:
   
   ```json
   {
           "table_s":"sea_surface_temp",
           "geo":"POLYGON((25.632 -42.376, 39.239 -42.376, 39.239 -36.698, 25.632 -36.698, 25.632 -42.376))",
           "id":"3f5aae6a-f6e1-31e0-b24e-0c648beabb70",
           "solr_id_s":"ascat-l2-coastal-multi-t5!3f5aae6a-f6e1-31e0-b24e-0c648beabb70",
           "sectionSpec_s":"NUMCELLS:30:60,NUMROWS:2850:2880",
           "dataset_s":"ascat-l2-coastal-multi-t5",
           "granule_s":"ascat_20170102_182400_metopb_22278_eps_o_coa_2401_ovw.l2.nc",
           "tile_var_name_s":"[\"wind_speed\", \"wind_dir\"]",
           "tile_standard_name_s":"[\"wind_speed\", \"wind_to_direction\"]",
           "tile_min_lon":[25.63173],
           "tile_max_lon":[39.238640000000004],
           "tile_min_lat":[-42.375870000000006],
           "tile_max_lat":[-36.69843],
           "tile_depth":[0.0],
           "tile_min_time_dt":"2017-01-02T19:53:03Z",
           "tile_max_time_dt":"2017-01-02T19:53:58Z",
           "tile_min_val_d":7.349999904632568,
           "tile_max_val_d":196.40000915527344,
           "tile_avg_val_d":71.52783203125,
           "tile_count_i":1800,
           "_version_":1712176647857242112
   }
   ```
   
   Tested this dataset with the matchup endpoint and it works as expected. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-sdap-nexus] skorper commented on pull request #132: SDAP-338: Update match up implementation to support multi-variable tiles

Posted by GitBox <gi...@apache.org>.
skorper commented on pull request #132:
URL: https://github.com/apache/incubator-sdap-nexus/pull/132#issuecomment-912109521


   @fgreg I think that validates my decision to put the num_vars dimension at the beginning. The ingester would be times x lats x lons x num_vars whereas I'm transforming to num_vars x times x lats x lons


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-sdap-nexus] skorper commented on pull request #132: SDAP-338: Update match up implementation to support multi-variable tiles

Posted by GitBox <gi...@apache.org>.
skorper commented on pull request #132:
URL: https://github.com/apache/incubator-sdap-nexus/pull/132#issuecomment-930323278


   Further updated the backwards compatibility of the solr doc -- works when the `tile_standard_name_s` field is not present. Tested this on dataset `ascat-l2-coastal-multi-t1`. Sample solr doc:
   
   ```json
   {
           "table_s":"sea_surface_temp",
           "geo":"POLYGON((29.305 40.166, 43.460 40.166, 43.460 45.753, 29.305 45.753, 29.305 40.166))",
           "id":"7b2b099e-5a21-3b92-8d89-ad2235db4389",
           "solr_id_s":"ascat-l2-coastal-multi-t1!7b2b099e-5a21-3b92-8d89-ad2235db4389",
           "sectionSpec_s":"NUMCELLS:30:60,NUMROWS:360:390",
           "dataset_s":"ascat-l2-coastal-multi-t1",
           "granule_s":"ascat_20170102_182400_metopb_22278_eps_o_coa_2401_ovw.l2.nc",
           "tile_var_name_s":"[\"wind_speed\", \"wind_to_direction\"]",
           "tile_min_lon":[29.30514],
           "tile_max_lon":[43.45994],
           "tile_min_lat":[40.16612000000001],
           "tile_max_lat":[45.752610000000004],
           "tile_depth":[0.0],
           "tile_min_time_dt":"2017-01-02T18:35:15Z",
           "tile_max_time_dt":"2017-01-02T18:36:09Z",
           "tile_min_val_d":1.3299999237060547,
           "tile_max_val_d":252.8000030517578,
           "tile_avg_val_d":33.11838150024414,
           "tile_count_i":526,
           "_version_":1708470605235355648
   }
   ```
   
   Confirmed matchup works as expected with this solr doc. Tested the following cases with this latest change:
   
   1. Solr doc with `tile_var_name_s` but no `tile_standard_name_s`
   2. Solr doc with `tile_var_name_s` and `tile_standard_name_s`
   3. Solr doc with `tile_var_name_ss`
   
   All work as expected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-sdap-nexus] skorper commented on a change in pull request #132: SDAP-338: Update match up implementation to support multi-variable tiles

Posted by GitBox <gi...@apache.org>.
skorper commented on a change in pull request #132:
URL: https://github.com/apache/incubator-sdap-nexus/pull/132#discussion_r709673761



##########
File path: data-access/nexustiles/model/nexusmodel.py
##########
@@ -100,6 +136,8 @@ def nexus_point_generator(self, include_nan=False):
     def get_indices(self, include_nan=False):
         if include_nan:
             return list(np.ndindex(self.data.shape))
+        if self.is_multi:
+            return np.argwhere(self.data[0])

Review comment:
       Thanks @frankinspace, I will make that change.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-sdap-nexus] fgreg commented on pull request #132: SDAP-338: Update match up implementation to support multi-variable tiles

Posted by GitBox <gi...@apache.org>.
fgreg commented on pull request #132:
URL: https://github.com/apache/incubator-sdap-nexus/pull/132#issuecomment-912106246


   quick comment re: moving num vars dimension. I remember reading somewhere that it is generally recommended to have the smallest dimension be the first dimension. I don't recall exactly why but I thought it had something to do with how the arrays are represented in memory/disk it is faster to traverse a 2x30x30 than a 30x30x2. I remember that was part of the consideration with putting time as the first dimension at least because that was usually the smallest.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-sdap-nexus] skorper commented on a change in pull request #132: SDAP-338: Update match up implementation to support multi-variable tiles

Posted by GitBox <gi...@apache.org>.
skorper commented on a change in pull request #132:
URL: https://github.com/apache/incubator-sdap-nexus/pull/132#discussion_r709666976



##########
File path: data-access/nexustiles/model/nexusmodel.py
##########
@@ -100,6 +136,8 @@ def nexus_point_generator(self, include_nan=False):
     def get_indices(self, include_nan=False):
         if include_nan:
             return list(np.ndindex(self.data.shape))
+        if self.is_multi:
+            return np.argwhere(self.data[0])

Review comment:
       The data is formatted such that data[0] == variable 1, data[1] == variable 2, etc. I am only getting the indices for data[0], the first variable because I assumed the mask would be the same across all variables. If that isn't the case I would need to retrieve the ~mask~indices for the entire data array, which I was trying to avoid because it's *huge* for swath multi-var data.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-sdap-nexus] skorper commented on pull request #132: SDAP-338: Update match up implementation to support multi-variable tiles

Posted by GitBox <gi...@apache.org>.
skorper commented on pull request #132:
URL: https://github.com/apache/incubator-sdap-nexus/pull/132#issuecomment-912112148


   Swath multi-var is very slow though. In the 30x30 case, it results in 2 x 900 x 900 x 900 data which is is almost entirely masked. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-sdap-nexus] skorper commented on pull request #132: SDAP-338: Update match up implementation to support multi-variable tiles

Posted by GitBox <gi...@apache.org>.
skorper commented on pull request #132:
URL: https://github.com/apache/incubator-sdap-nexus/pull/132#issuecomment-915451724


   Updated response fields to 'variable_name' as-per the feedback received on Friday. Also added an extra test for multi-var sat to multi-var sat. 
   
   Once [this](https://github.com/apache/incubator-sdap-ingester/pull/39) is merged and deployed to bigdata, the standard_name issue should be fixed for ascat l2 and I will add the new `cf_variable_name` field.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-sdap-nexus] skorper commented on pull request #132: SDAP-338: Update match up implementation to support multi-variable tiles

Posted by GitBox <gi...@apache.org>.
skorper commented on pull request #132:
URL: https://github.com/apache/incubator-sdap-nexus/pull/132#issuecomment-929755901






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-sdap-nexus] skorper commented on a change in pull request #132: SDAP-338: Update match up implementation to support multi-variable tiles

Posted by GitBox <gi...@apache.org>.
skorper commented on a change in pull request #132:
URL: https://github.com/apache/incubator-sdap-nexus/pull/132#discussion_r709666976



##########
File path: data-access/nexustiles/model/nexusmodel.py
##########
@@ -100,6 +136,8 @@ def nexus_point_generator(self, include_nan=False):
     def get_indices(self, include_nan=False):
         if include_nan:
             return list(np.ndindex(self.data.shape))
+        if self.is_multi:
+            return np.argwhere(self.data[0])

Review comment:
       The data is formatted such that data[0] == variable 1, data[1] == variable 2, etc. I am only getting the indices for data[0], the first variable because I assumed the mask would be the same across all variables. If that isn't the case I would need to retrieve the mask for the entire data array, which I was trying to avoid because it's *huge* for swath multi-var data.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-sdap-nexus] frankinspace commented on a change in pull request #132: SDAP-338: Update match up implementation to support multi-variable tiles

Posted by GitBox <gi...@apache.org>.
frankinspace commented on a change in pull request #132:
URL: https://github.com/apache/incubator-sdap-nexus/pull/132#discussion_r709669114



##########
File path: data-access/nexustiles/model/nexusmodel.py
##########
@@ -100,6 +136,8 @@ def nexus_point_generator(self, include_nan=False):
     def get_indices(self, include_nan=False):
         if include_nan:
             return list(np.ndindex(self.data.shape))
+        if self.is_multi:
+            return np.argwhere(self.data[0])

Review comment:
       I see, so you're trying to return the indices where valid data exist? That would be dependent on the data variable.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org