You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Peter Vary <pv...@cloudera.com.INVALID> on 2020/08/26 13:24:38 UTC

Question about logging

Hi Team,

I was wondering if there is a general best practice for using log levels in Iceberg, what is the usual way we do it.

I have been playing around with Iceberg/Hive integration and was wondering how I would be able to debug a customer case based on the logs.
To be the honest answer based on the current code is most probably using some additional method to generate new log lines, which I think is not a good situation.

I was thinking about adding some more logging to the code and added the proposed log level next to it:
CatalogLoader/Catalogs:
Which catalog is used to fetch the table - INFO
What is the table being fetched in the end - INFO
HiveCatalog:
Table is created/dropped/renamed - INFO
Catalog is created/dropped - INFO
HiveTableOperation:
Write committed / table created - INFO 
IcebergInputFormat:
Scan parameters provided - DEBUG
SnapshotId
Time
Filters
Final snapshotId being read - INFO (if I find a way to get this)
Opening specific files and their format - DEBUG

Are the logs above are compliant with the current practice? Shall I modify them up or down? Any places around Iceberg/Hive you think it would be good to add logging?

Thanks,
Peter

Re: Question about logging

Posted by Peter Vary <pv...@cloudera.com.INVALID>.
Hi Ryan,

Thanks for the detailed answer, and the confirmation! Created the pull request:

https://github.com/apache/iceberg/pull/1394 <https://github.com/apache/iceberg/pull/1394>
Hive: Add logging for Hive related classes #1394

 <https://github.com/apache/iceberg/pull/1394>
One comment for the message below: One place where DEBUG level can be useful, if an issue is only reproducible on customer side they can turn on DEBUG logging and share the logs - connecting with a debugger is not always desirable.

I have seen multiline log messages in Iceberg - this could cause hard to read logs if there are plenty of concurrent users and the log lines start to mix-up. I suggest collapsing them to a single line.

Thanks,
Peter


> On Aug 26, 2020, at 18:33, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> Hi Peter,
> 
> Thanks for thinking about this! Improving logs is a great contribution.
> 
> My philosophy is to stick to the usual definitions of logging levels. Here’s my quick summary:
> 
> FATAL: the event that caused the service to stop (not used in Iceberg, since it’s a library)
> ERROR: an event that stops an operation
> WARNING: an unexpected event where the operation can continue but there may be a bigger problem
> INFO: events that help you understand what’s happening in normal operation
> DEBUG: events that help you spot bugs by looking for unexpected values
> In general, I think that we can do better with INFO logging. But there is a balance to target because we don’t want Iceberg to be so verbose that people turn the logging off. We have to remember that Iceberg is typically integrated into some other framework that is more important. Iceberg should log the messages that help someone understand what the engine is doing, rather than logging internal concerns. A good example of what not to do comes from Parquet, where INFO messages were printed with stats from each row group — this was too much because Parquet’s contribution to the overall picture of what the engine was doing was just that it was storing data at the file level. Parquet internal concerns should not have leaked.
> 
> I also don’t typically like to use DEBUG logging because you almost never keep those logs. For debugging, I prefer to use a debugger and keep the code clean. I think you could argue that DEBUG is appropriate for cases like my Parquet example above, where INFO isn’t a good idea for the final application, but you may still want to log information about the operations in a library. I think that’s fine, just not things like logging the value of arguments to functions all over the codebase. I think the use of DEBUG for files that are being opened sounds reasonable.
> 
> I think the additional logs you’re proposing mostly look good. The scan configuration is logged in Iceberg’s BaseTableScan, so you should already get snapshot ID and filters. I’m not sure that we keep the time. We also have a system for emitting events with more detail. Those can be used for additional logging if you choose. We send them through a Kafka pipeline so we can analyze all the scans taking place in our data warehouse.
> 
> rb
> 
> 
> On Wed, Aug 26, 2020 at 6:24 AM Peter Vary <pv...@cloudera.com.invalid> wrote:
> Hi Team,
> 
> I was wondering if there is a general best practice for using log levels in Iceberg, what is the usual way we do it.
> 
> I have been playing around with Iceberg/Hive integration and was wondering how I would be able to debug a customer case based on the logs.
> To be the honest answer based on the current code is most probably using some additional method to generate new log lines, which I think is not a good situation.
> 
> I was thinking about adding some more logging to the code and added the proposed log level next to it:
> CatalogLoader/Catalogs:
> Which catalog is used to fetch the table - INFO
> What is the table being fetched in the end - INFO
> HiveCatalog:
> Table is created/dropped/renamed - INFO
> Catalog is created/dropped - INFO
> HiveTableOperation:
> Write committed / table created - INFO 
> IcebergInputFormat:
> Scan parameters provided - DEBUG
> SnapshotId
> Time
> Filters
> Final snapshotId being read - INFO (if I find a way to get this)
> Opening specific files and their format - DEBUG
> 
> Are the logs above are compliant with the current practice? Shall I modify them up or down? Any places around Iceberg/Hive you think it would be good to add logging?
> 
> Thanks,
> Peter
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Re: Question about logging

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Hi Peter,

Thanks for thinking about this! Improving logs is a great contribution.

My philosophy is to stick to the usual definitions of logging levels.
Here’s my quick summary:

   - FATAL: the event that caused the service to stop (not used in Iceberg,
   since it’s a library)
   - ERROR: an event that stops an operation
   - WARNING: an unexpected event where the operation can continue but
   there may be a bigger problem
   - INFO: events that help you understand what’s happening in normal
   operation
   - DEBUG: events that help you spot bugs by looking for unexpected values

In general, I think that we can do better with INFO logging. But there is a
balance to target because we don’t want Iceberg to be so verbose that
people turn the logging off. We have to remember that Iceberg is typically
integrated into some other framework that is more important. Iceberg should
log the messages that help someone understand what the engine is doing,
rather than logging internal concerns. A good example of what not to do
comes from Parquet, where INFO messages were printed with stats from each
row group — this was too much because Parquet’s contribution to the overall
picture of what the engine was doing was just that it was storing data at
the file level. Parquet internal concerns should not have leaked.

I also don’t typically like to use DEBUG logging because you almost never
keep those logs. For debugging, I prefer to use a debugger and keep the
code clean. I think you could argue that DEBUG is appropriate for cases
like my Parquet example above, where INFO isn’t a good idea for the final
application, but you may still want to log information about the operations
in a library. I think that’s fine, just not things like logging the value
of arguments to functions all over the codebase. I think the use of DEBUG
for files that are being opened sounds reasonable.

I think the additional logs you’re proposing mostly look good. The scan
configuration is logged in Iceberg’s BaseTableScan, so you should already
get snapshot ID and filters. I’m not sure that we keep the time. We also
have a system for emitting events with more detail. Those can be used for
additional logging if you choose. We send them through a Kafka pipeline so
we can analyze all the scans taking place in our data warehouse.

rb

On Wed, Aug 26, 2020 at 6:24 AM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> Hi Team,
>
> I was wondering if there is a general best practice for using log levels
> in Iceberg, what is the usual way we do it.
>
> I have been playing around with Iceberg/Hive integration and was wondering
> how I would be able to debug a customer case based on the logs.
> To be the honest answer based on the current code is most probably using
> some additional method to generate new log lines, which I think is not a
> good situation.
>
> I was thinking about adding some more logging to the code and added the
> proposed log level next to it:
>
>    - CatalogLoader/Catalogs:
>       - Which catalog is used to fetch the table - INFO
>       - What is the table being fetched in the end - INFO
>    - HiveCatalog:
>       - Table is created/dropped/renamed - INFO
>       - Catalog is created/dropped - INFO
>    - HiveTableOperation:
>       - Write committed / table created - INFO
>    - IcebergInputFormat:
>       - Scan parameters provided - DEBUG
>          - SnapshotId
>          - Time
>          - Filters
>       - Final snapshotId being read - INFO (if I find a way to get this)
>       - Opening specific files and their format - DEBUG
>
>
> Are the logs above are compliant with the current practice? Shall I modify
> them up or down? Any places around Iceberg/Hive you think it would be good
> to add logging?
>
> Thanks,
> Peter
>


-- 
Ryan Blue
Software Engineer
Netflix