You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@sdap.apache.org by GitBox <gi...@apache.org> on 2021/10/18 04:23:32 UTC

[GitHub] [incubator-sdap-ingester] jasonmlkang opened a new pull request #48: Read time from filename

jasonmlkang opened a new pull request #48:
URL: https://github.com/apache/incubator-sdap-ingester/pull/48


   Some datasets lack a time variable and attributes and only indicate the date and/or time in the filename.  In these cases, the ingester needs to be able to extract the time stamp from the filenames according to a new regular expression setting in the collections-config ConfigMap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sdap-ingester] tloubrieu-jpl commented on a change in pull request #48: Read time from filename

Posted by GitBox <gi...@apache.org>.

tloubrieu-jpl commented on a change in pull request #48:
URL: https://github.com/apache/incubator-sdap-ingester/pull/48#discussion_r745816922



##########
File path: granule_ingester/granule_ingester/processors/TileSummarizingProcessor.py
##########
@@ -27,17 +30,45 @@ class NoTimeException(Exception):
     pass
 
 
-def find_time_min_max(tile_data):
+def find_time_min_max(tile_data, time_from_granule=None):
     if tile_data.time:
         if isinstance(tile_data.time, nexusproto.ShapedArray):
             time_data = from_shaped_array(tile_data.time)
             return int(numpy.nanmin(time_data).item()), int(numpy.nanmax(time_data).item())
-        elif isinstance(tile_data.time, int):
+        elif isinstance(tile_data.time, int) and \
+             tile_data.time > datetime.datetime(1970, 1, 1).timestamp():  # time should be at least greater than Epoch, right?
             return tile_data.time, tile_data.time
 
+    if time_from_granule:
+        return time_from_granule, time_from_granule
+
     raise NoTimeException
 
 
+def get_time_from_granule(granule: str) -> int:
+    """
+    Get time from granule name. It makes the assumption that a datetime is
+    specificed in the granule name, and it has the following format "YYYYddd",
+    where YYYY is 4 digit year and ddd is day of year (i.e. 1 to 365).
+    
+    Note: This is a very narrow implmentation for a specific need. If you found
+    yourself needing to modify this function to accommodate more use cases, then
+    perhaps it is time to refactor this function to a more dynamic module.
+    """
+
+    # rs search for a sub str which starts with 19 or 20, and has 7 digits
+    search_res = re.search('((?:19|20)[0-9]{2})([0-9]{3})', os.path.basename(granule))
+    if not search_res:
+        return None
+    
+    year, days = search_res.groups()
+    year = int(year)
+    days = int(days)
+
+    # since datetime is set to 1/1 (+1 day), timedelta needs to -1 to cancel it out
+    return int((datetime.datetime(year, 1, 1) + datetime.timedelta(days-1)).timestamp())

Review comment:
       Examples of file names from AQACF project:
   `V5GL01.HybridPM25.NorthAmerica.200807-200807.nc`
   `V4NA03_PM25_NA_200401_200412-RH35.nc`
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sdap-ingester] jasonmlkang commented on pull request #48: Read time from filename

Posted by GitBox <gi...@apache.org>.

jasonmlkang commented on pull request #48:
URL: https://github.com/apache/incubator-sdap-ingester/pull/48#issuecomment-972140635


   @tloubrieu-jpl Updated the code to use strptime instead of regex


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sdap-ingester] tloubrieu-jpl commented on a change in pull request #48: Read time from filename

Posted by GitBox <gi...@apache.org>.

tloubrieu-jpl commented on a change in pull request #48:
URL: https://github.com/apache/incubator-sdap-ingester/pull/48#discussion_r745807760



##########
File path: granule_ingester/granule_ingester/processors/TileSummarizingProcessor.py
##########
@@ -27,17 +30,45 @@ class NoTimeException(Exception):
     pass
 
 
-def find_time_min_max(tile_data):
+def find_time_min_max(tile_data, time_from_granule=None):
     if tile_data.time:
         if isinstance(tile_data.time, nexusproto.ShapedArray):
             time_data = from_shaped_array(tile_data.time)
             return int(numpy.nanmin(time_data).item()), int(numpy.nanmax(time_data).item())
-        elif isinstance(tile_data.time, int):
+        elif isinstance(tile_data.time, int) and \
+             tile_data.time > datetime.datetime(1970, 1, 1).timestamp():  # time should be at least greater than Epoch, right?
             return tile_data.time, tile_data.time
 
+    if time_from_granule:
+        return time_from_granule, time_from_granule
+
     raise NoTimeException
 
 
+def get_time_from_granule(granule: str) -> int:
+    """
+    Get time from granule name. It makes the assumption that a datetime is
+    specificed in the granule name, and it has the following format "YYYYddd",
+    where YYYY is 4 digit year and ddd is day of year (i.e. 1 to 365).
+    
+    Note: This is a very narrow implmentation for a specific need. If you found
+    yourself needing to modify this function to accommodate more use cases, then
+    perhaps it is time to refactor this function to a more dynamic module.
+    """
+
+    # rs search for a sub str which starts with 19 or 20, and has 7 digits
+    search_res = re.search('((?:19|20)[0-9]{2})([0-9]{3})', os.path.basename(granule))
+    if not search_res:
+        return None
+    
+    year, days = search_res.groups()
+    year = int(year)
+    days = int(days)
+
+    # since datetime is set to 1/1 (+1 day), timedelta needs to -1 to cancel it out
+    return int((datetime.datetime(year, 1, 1) + datetime.timedelta(days-1)).timestamp())

Review comment:
       @jasonmlkang did you have a chance to look at my comments. We're having similar need for AQACF project but of course the file naming convention is different so we would need to discuss how that could be configured.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sdap-ingester] jasonmlkang commented on pull request #48: Read time from filename

Posted by GitBox <gi...@apache.org>.

jasonmlkang commented on pull request #48:
URL: https://github.com/apache/incubator-sdap-ingester/pull/48#issuecomment-966906135


   @tloubrieu-jpl I had implemented a more generic way to parse date time from file name. The commit is 92eb7dc. Please take a look [ParseDatetimeUtilsTests.py](https://github.com/apache/incubator-sdap-ingester/pull/48/files#diff-b57346559a274b7c91d1e89713994f49feef56d7eb910ad61f1e7f2c3c4a3f1e) for examples of how to use the parser. Notice that this is only the first take and it is not the complete work. There are additional works to fine tune the time zone and integrate this into sdap ingester.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sdap-ingester] tloubrieu-jpl commented on pull request #48: Read time from filename

Posted by GitBox <gi...@apache.org>.

tloubrieu-jpl commented on pull request #48:
URL: https://github.com/apache/incubator-sdap-ingester/pull/48#issuecomment-1018723859


   Hi @jasonmlkang , are you still working on this pull request ?
   I think you were trying to deploy that on your laptop to test from completelly the new feature.
   
   From reading the code, I believe something is missing around https://github.com/apache/incubator-sdap-ingester/blob/7ee17fdf16201c499f7bd35cf398844f2c70f046/granule_ingester/granule_ingester/processors/reading_processors/GridReadingProcessor.py#L43 so that the time read in the file is used a dimension in the dataset. 
   
   I am coming back to this pull request now because we would need it for AQACF project for the GMU dataset.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-sdap-ingester] tloubrieu-jpl commented on a change in pull request #48: Read time from filename

Posted by GitBox <gi...@apache.org>.

tloubrieu-jpl commented on a change in pull request #48:
URL: https://github.com/apache/incubator-sdap-ingester/pull/48#discussion_r739423815



##########
File path: granule_ingester/granule_ingester/processors/TileSummarizingProcessor.py
##########
@@ -27,17 +30,45 @@ class NoTimeException(Exception):
     pass
 
 
-def find_time_min_max(tile_data):
+def find_time_min_max(tile_data, time_from_granule=None):
     if tile_data.time:
         if isinstance(tile_data.time, nexusproto.ShapedArray):
             time_data = from_shaped_array(tile_data.time)
             return int(numpy.nanmin(time_data).item()), int(numpy.nanmax(time_data).item())
-        elif isinstance(tile_data.time, int):
+        elif isinstance(tile_data.time, int) and \
+             tile_data.time > datetime.datetime(1970, 1, 1).timestamp():  # time should be at least greater than Epoch, right?
             return tile_data.time, tile_data.time
 
+    if time_from_granule:
+        return time_from_granule, time_from_granule
+
     raise NoTimeException
 
 
+def get_time_from_granule(granule: str) -> int:
+    """
+    Get time from granule name. It makes the assumption that a datetime is
+    specificed in the granule name, and it has the following format "YYYYddd",
+    where YYYY is 4 digit year and ddd is day of year (i.e. 1 to 365).
+    
+    Note: This is a very narrow implmentation for a specific need. If you found
+    yourself needing to modify this function to accommodate more use cases, then
+    perhaps it is time to refactor this function to a more dynamic module.
+    """
+
+    # rs search for a sub str which starts with 19 or 20, and has 7 digits
+    search_res = re.search('((?:19|20)[0-9]{2})([0-9]{3})', os.path.basename(granule))
+    if not search_res:
+        return None
+    
+    year, days = search_res.groups()
+    year = int(year)
+    days = int(days)
+
+    # since datetime is set to 1/1 (+1 day), timedelta needs to -1 to cancel it out
+    return int((datetime.datetime(year, 1, 1) + datetime.timedelta(days-1)).timestamp())

Review comment:
       I think we want to be careful about the UTC timezone.
   
   For example with code
   
       import pytz
       aware = datetime.datetime(2011,8,15,8,15,12,0,pytz.UTC)
   
   
   Or with the example in my comment above:
   
       import pytz
       datetime.strptime(filename, 'A%Y%j.L3m_CHL_chlor_a_4km.nc').replace(tzinfo=pytz.utc)
   
   That would be more simple than the current version of the code and spare some lines of code.
   
   I am not sure if the hardcoded -1 days is also something we would like to have for all the datasets. Maybe an additional optional configuration variable for the collection should give an offset for the min and offset for the max time. 
   In this case that would be -1 and -1 for both.
   
   I guess the  most difficult change in my request is to read from the collection configmap. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@sdap.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org