Dynamic Logging and Profiling for Cloud Services

Ravi Kant Gupta
Dec 1, 2019
4 min read

Disclaimer

This document doesn’t contain any confidential or proprietary information related to my current or previous employers. This document talks about one of my innovative idea related with quick diagnostics for issues related with cloud services.

Problem Description and Solution

Logging is not a new concept. Logging is the process of writing log messages during the execution of a program to a log file or a centralized repository. Logging allows you to report and persist trace, error, warning and info messages (e.g., run time statistics) so that the messages can later be retrieved and analysed when functionality fails. Logging can be enabled at different log levels in a software system. Most of the logging framework support FATAL, WARNING, INFO, DEBUG and TRACE log levels. With log level FATAL only exceptions get printed in the log file which may not be sufficient for functionality failure analysis, while with TRACE log level all logs are printed in log file which are normally sufficient for functionality failure analysis. There is a side effect of logging, i.e. it slows down the system, especially when logging is enabled at TRACE level and system is instrumented with too many logs then it slows down the system drastically. Logs help a developer to triage an issue but logs are not important from customer’s point of view. Customer always expects high performance from the software system. That is why in the production environment software systems usually have FATAL log level.

Now, consider a case of a cloud system (IaaS, PaaS or SaaS) which is catering to hundreds of tenants and every tenant has hundreds of users accessing the service. If in such kind of environment an end user reports a problem related with functionality failure and operations engineer finds an exception trace in the logs and asks developer to look into it. Then developer always asks for detailed logs. If logging is enabled at TRACE level in the system then it may bring down the performance drastically for all tenants and all users. This will be the disaster from SLA perspective and may result into huge penalty for a service provider.

Now, consider a case of a slow performing use case when system is running with FATAL log level. In case of slow performing use case there is no exception generated and nothing get logged in log files. Sometime developers are completely clueless about slow performance of the product in the production environment and they provide multiple diagnostic patch with too many logs instrumented in code for a specific use case or they ask to run the system with system profiling tool. A system profiling tool is a form of dynamic program analysis tool that measures memory consumption, usage of particular instruction, duration and frequency of a function call etc. When a profiling tool is enabled then system runs in context of the profiler and profiler keeps track of all system resources.

Considering without detailed information issue triaging can’t be done, this situation leads to a requirement of dynamically increasing the log level for a specific use case, API, user etc and dynamically profiling the slow running use case. If dynamic logging and profiling tool can be implemented then for critical errors and slow performing use cases analysis can be done without applying patches in production environment and bringing it down. This is very important just because it will result into less downtime, less production maintenance cost and quick turnaround time for fixing critical production errors. Ultimately it will result into better customer satisfaction level with the product.

Dynamic logging is a unique concept which simply says that if only one part of the system is not behaving properly then inspect only that part of the system and produce more information without impacting performance of entire system. With such kind of logging only a specific use-case execution will be slow for specific time period or user only. I along with one of my team member implemented it successfully.

Dynamic configurable profiling is also a unique concept of profiling a system, which says that if one part of system is not performing properly then enable profiling in related java classes for a specific time period or specific number of execution of a use case. In order to implement dynamic configurable profiling system we used a library, which was developed internally within Oracle by profiling tools team. This library was a java agent, which runs as part of your java application. It works on the basis of a configuration file. If a configuration file exists and java agent is turned on then only it comes into the picture at runtime and impacts systems performance for configured files only. Generation of configuration file for a functionality or use case was also automated. Each and every use case was associated with a test case. The framework was designed in such a way that when you choose a use case for profiling, it used to run the associated testcase to generate the configuration file.

Dynamic logging & profiling (DLP) framework provides a user interface to specify various logging and profiling options. This framework has capability of dynamically collecting use case information. As a developer one can simply apply an annotation @USECASE with any method of a class. This annotation enables DLP to collect or discover all use case (monitorable methods) information automatically. DLP user interface shows a list of collected use cases with proper package hierarchy. One can enable/disable DLP for a specific duration with different options and DLP generates logs and profiling data for future analysis.

Overall result of implementing DLP framework and consuming this framework into product can be very positive. DLP framework will enable you to quickly identify the root cause of critical functional issues and fix them. Several times slowness is because of poorly written custom plugin code by the implementation partner or customer. This framework can reduce total turnaround time of resolving customer queries. This framework can be a huge asset for developers community and can help them to keep their focus on building new functionality rather than building diagnostic patches. Considering this framework will minimize the wastage in the process, it will help in setting up high performance team and more happy faces on the floor.

Dynamic Logging and Profiling for Cloud Services

Recent Posts

Comments