Tuesday, April 24, 2007

problem of application monitoring

Background

Applications are frequently degraded or otherwise inaccessible to users. It is hard to find out in any kind of timely manner that there have been application problems. Indeed, sometimes one does not learn that an application has been down until a significant amount of time has passed. This indicates a need for a tool or a set of tools that are able to indicate a high-level status of the applications. Such a tool falls within the set of tools collectively referred to as application monitoring. This paper presents a brief discussion on these methods, tools, and the justification for the adoption of such technology.

Discussion

Monitoring applications to detect and respond to problems - before an end user is even aware that a problem exists - is a common systems requirement. Most administrators understand the need for application monitoring. Infrastructure teams, in fact, typically monitor the basic health of application servers by keeping an eye on CPU utilization, throughput, memory usage and the like. However, there are many parts to an application server environment, and understanding which metrics to monitor for each of these pieces differentiates those environments that can effectively anticipate production problems from those that might get overwhelmed by them.

When applied in an appropriate context, application monitoring is more than just the data that shows how an application is performing technically. Information such as page hits, frequency and related statistics contrasted against each other can also show which applications, or portions thereof, have consistently good (or bad) performance. Management reports generated from the collected raw data can provide insights on the volume of users that pass though the application.

There are fundamentally two ways to approach problem solving in a production environment:

1. One is through continual data collection through the use of application monitoring tools that, typically, provide up-to-date performance, health and status information.

2. The other is through trial and error theorizing, often subject to whatever data is available from script files and random log parsing.

Not surprisingly, the latter approach is less efficient, but it's important to understand its other drawbacks as well. Introducing several levels of logging to provide various types of information has long been a popular approach to in-house application monitoring, and for good reason. Logging was a very trusted methodology of the client-server era for capturing events happening on remote workstations to help determine application problems. Today, with browsers dominating the thin client realm, there is little need for collecting data on the end user's workstation. Therefore, user data is now collected at centralized server locations instead. However, with the general assumption that all possible points of logging are anticipated and appropriately coded, data collection on the server is also problematic. More often than not, logging is applied inconsistently within an application, often added only as problems are encountered and more information is needed.

In contrast, application monitoring tools offer the ability to quickly add new data -without application code changes - to information that is already being collected, as the need for different data changes with the ongoing analysis.

While logging worked well in the single user environment, there are some inherent problems with logging in the enterprise application server environment:

• Clustered environments are not conducive to centralized logs. This is a systemic problem for large environments with multiple servers and multiple instances of an application. On top of the problem of exactly how one is to administer the multiple logs, is the user's ability to bounce around application servers for applications that do not use HTTP Session objects. Coordinating and consolidating events for the same user spread across multiple logs is extremely difficult and time consuming.

• Multiple instances of applications and their threads writing to the same set of logs imposes a heavy penalty on applications that essentially spend time synchronized in some logging framework. High volume Web sites are an environment where synchronization of any kind must be avoided in order to reduce any potential bottlenecks that could result in poor response times and, subsequently, a negative end user experience.

• Varying levels of logging requires additional attention: when a problem occurs, the next level of logging must be turned on. This means valuable data from the first occurrence of the problem is lost. With problems that are not readily reproducible, it's difficult to predict when logging should be on or off.

• Logs on different machines can have significant timestamp differences,
making correlation of data between multiple logs nearly impossible.

• Beyond the impact of actually adding lines of code to an application for monitoring, additional development impacts include:

o Code maintenance: The functionality, logical placement and data collected will need to be kept up, hopefully by developers who understand the impact of the code change that was introduced.

o Inconsistent logging: Different developers may have drastically different interpretations of what data to collect and when to collect it. Such inconsistencies are not easily corrected.

o Developer involvement: Involving developers in problem determination becomes a necessity with log-based approaches, since the developer is usually the best equipped to interpret the data.

• Application monitoring accomplished through coding is rarely reused. Certainly the framework itself can be reused, but probably not the lines of code inserted to capture specific data.

• When logging to a file, the impact on the server's file I/O subsystem is significant. Few things will slow down an enterprise application more than writing to a file. While server caches and other mechanisms can be configured to minimize such a hit, this is still a serious and unavoidable bottleneck, especially in high volume situations where the application is continually sending data to the log.

• While Aspect-Oriented Programming is proving a valuable technology for logging, it has yet to be embraced by the technical community.
Not surprisingly, it is also common for development teams to try to collect basic performance data using their logging framework, capturing data such as servlet response time, or the timings of specific problematic methods, etc., in order to better understand how the application performs. This activity is victim to the same disadvantages mentioned above, in that any suspected problem points are correctly identified and instrumented. If new data points are identified, then the application must be modified to accommodate the additional data collection, retested and then redeployed to the production environment. Naturally, such code also requires continual maintenance for the life of the application.

The benefits of a proactive, tool-based approach to application monitoring are many:

• No code

This, by far, is the single most valuable benefit regarding a tools-based approach. Application monitoring tools allow for the seamless and invisible collection of data without writing a single line of code.

• Fewer developer distractions

With application monitoring no longer a focal point, developers can instead concentrate on the logic of the application.

• Reusability

Application monitoring tools are written to generically capture data from any application, resulting in a tremendous amount of reuse built into the tooling itself. Without doing anything extraordinary, an application monitoring tool can capture data for a variety of applications as they come online.

• Reliability

While you should still perform due diligence to ensure that a tool is working properly in your environment, application monitoring tools from major vendors are generally subject to extensive testing and quality assurance for high volume environments.

• Understandable results

Consolidation of data occurs at some central console and the results can be readily understood by a systems administrator. Only when the system administrator has exhausted all resources would developers need to assist in troubleshooting by examining data from a variety of subsystems.

• Cost

While there is the initial expenditure of procuring such a tool, there is also the very real possibility of eventual cost savings - particularly in terms of time.

In general, application monitoring can be divided into the following categories:

1. Fault

This type of monitoring is primarily to detect major errors related to one or more components. Faults can consist of errors such as the loss of network connectivity, a database server going off line, or the application suffers a Java out-of-memory situation. Faults are important events to detect in the lifetime of an application because they negatively affect the user experience.

2. Performance

Performance monitoring is specifically concerned with detecting less than desirable application performance, such as degraded servlet, database or other back end resource response times. Generally, performance issues arise in an application as the user load increases. Performance problems are important events to detect in the lifetime of an application since they, like Fault events, negatively affect the user experience.

3. Configuration

Configuration monitoring is a safeguard designed to ensure that configuration variables affecting the application and the back end resources remain at some predetermined configuration settings. Configurations that are incorrect can negatively affect the application performance. Large environments with several machines, or environments where administration is manually performed, are candidates for mistakes and inconsistent configurations. Understanding the configuration of the applications and resources is critical for maintaining stability.

4. Security

Security monitoring detects intrusion attempts by unauthorized system users.
Each of these categories can be integrated into daily or weekly management reports for the application. If multiple application monitoring tools are used, the individual subsystems should be capable of either providing or exporting the collected data in different file formats that can then be fed into a reporting tool. Some of the more powerful application monitoring tools can not only monitor a variety of individual subsystems, but can also provide some reporting or graphing capabilities.

One of the major side benefits of application monitoring is in being able to establish the historical trends of an application. Applications experience generational cycles, where each new version of an application may provide more functionality and/or fixes to previous versions. Proactive application monitoring provides a way to gauge whether changes to the application have affected performance and, more importantly, how. If a fix to a previous issue is showing slower response times, one has to question whether the fix provided was properly implemented. Likewise, if new features prove to be especially slower than others, one can focus the development team on understanding the differences.

Historical data is achieved by defining a baseline based upon some predefined performance test and then re-executing the performance test when new application versions are made available. This baseline has to be performed on the application at some point in time and can be superceded by a new baseline once performance goals are met. Changes to the application are then directly measured against the baseline as a measurable quantity. Performance statistics also assist in resolving misconceptions about how an application is (or has been) performing, helping to offset subjective observations not based on fact. When performance data is not collected, subjective observations often lead to erroneous conclusions about application performance.

In the vein of extreme programming, one is urged to collect the bare minimum metrics and thresholds which you feel are needed for your application, selecting only those that will provide the data points necessary to assist in the problem determination process. Start with methods that access backend systems and servlet/JSP response timings. Prepare to change the set of collected metrics or thresholds as your environment evolves and grows.

There are two main factors measured by end user performance tools: availability and response time.

The first is measured by the uptime of the enterprise applications. Response time measurement looks at the time to complete a specific job - starting at the end users' desktops, through the network, the servers, and back.

End user performance management typically starts from a lightweight agent installed on the end user's computer. The agent records network availability, response time, delay, and occasional failures of requests initiated by the end user in real time. This data is forwarded to a central database. It then performs trend analysis by comparing the real-time data collected by the agents with historical patterns stored in the database. Reports are then generated to display the number of important measures such as transaction time, delays and traffic volume.

Response time has always been a key component in the measurement of performance. In this era of networks and rapid deployment of applications, the quest for end-to-end response time has become legend. Unfortunately, most of today's application performance solutions are more myth than substance.

There are two fundamental approaches to the problem:
- Using various proxies, such as ping
- Observing and measuring application flows.

Most response time metrics turn out to be derived values using simple ping tools. Ping is an invaluable tool, but it has severe limitations as a response time measure. Pinging routers is problematical because of questionable processing priority and consequently, reliability of the measurement. If you are pinging servers, how does response time vary with processing load and other factors? Vendors may claim they have other measures, but most are ping variants, perhaps with a slight improvement in reliability over classic ping. If ping is used, then the value derived must be used properly, as part of a series of measurements over time.
As an alternative response time measurement, some monitoring/probe products apply cadence or pattern-matching heuristics to observed packet streams.

These provide a measurement of apparent response time at the application level but this means deploying multiple, relatively expensive probes to analyze the packet stream. Existing RMON and SNMP standards do not cover this area, so all solutions rely on proprietary software to collect and report on the data. Other concerns are the quality of the heuristics, the scalability of the solution and the continuity of support across a product's lifetime.

As more and more enterprise applications are running in distributed computer networks, the loss of revenue due to down time or poor performance of the applications is increasing exponentially. This has created the need for diligent management of distributed applications. Management of distributed applications involves accurate monitoring of end-user service level agreements, and mapping them to the application level, system level, and network level parameters. In this paper, we provides a statistical analysis on mapping application level response time to network related parameters such as link bandwidth and router throughput by using some simple queueing models.

With more and more enterprises running their mission-critical e-business applications in distributed computing net-works, the effective management of these applications is crucial for the success of business. The loss of revenue due to downtime or poor performance of these distributed applications increases exponentially. Distributed applications operate in a very different environment compared to client/server applications.

In client/server paradigm, the components of a software application are shared between client and server computers. In a distributed computing environment, the application can have their components running on many computers across an entire network. The distinction between the client and server disappears. Normally, a component in a distributed application acts as both client and server. A distributed application is an intelligent control entity that can be of any type that runs in a distributed environment. It can be a single component such as a web page, a database, a reusable component, a URL, a UNIX process, a Java class or EJB, etc. But theoretically, a distributed application is a combination of objects and processes with some dependent relationships that communicate with each other in order to provide a service to end users.

Monitoring solely from the client's side is another class of techniques. In contrast to the methods mentioned so far it is possible to measure the actual time an application needs to complete a transaction, i.e. it is metered from a user's perspective. Nevertheless, this class of techniques still suffers from one general problem: it is possible to detect an application's malfunction in the moment it happens but it still does not help in finding the root cause of the problem. Therefore in general this class of techniques is only useful to verify fulfillment of SLAs from a customer's point of view, but additional techniques have to be used for further analysis in order to detect the reason of a QoS problem. There are two basic methods for monitoring performance from a client's perspective: synthetic transactions and GUI based solutions.

The synthetic transactions method uses simulated transactions in order to measure the response time of an application server and to verify the received responses by comparing them to previously recorded reference transactions. Several simulator agents, acting as clients in a network, send requests to the application server of interest and measure the time needed to complete a transaction. In case response time exceeds a configurable threshold or the received server response is incorrect in some way, the agents usually inform the manager by generating events. As solely synthetic transactions are monitored and not real transactions initiated by actual users, this technique is only useful to take a snapshot of a server's availability, but not to verify the fulfillment of service level agreements. To get measurement data close to actual user experience, the interval between simulated transactions has to be reduced to a minimum. As a consequence the application service could experience serious performance degradation. Further problems arise from agent deployment in large networks.

The GUI based approach meters the actual user transactions but to avoid the need for accessing the client application's source code, a new approach was recently developed: As every user request both starts and ends with using/changing a GUI element at the client side (e.g. clicking a web link and displaying the appropriate web page afterwards), simply observing GUI events delivers the needed information about start and end points of user transactions. A software agent installed at client site devices gathers the transaction data of interest from a user's point of view. The advantages of this technique are that the actually occurring transaction duration is measured and that it can be applied to every application service client. Furthermore, only very little performance impact is caused on the monitored application. However, there seems to be two major problems. First of all, mapping GUI events to user transactions is a difficult task regarding non-standard applications and therefore requires additional effort by the administrator. Secondly, there are few agents that use this technique.

As mentioned before, client–based monitoring cannot identify the reason for performance degradation or malfunction of an application. Therefore solutions that monitor both from the client– and from the server–side are necessary. As details about the application and problems within the application cannot be gathered externally, these approaches rely on information supplied by the application itself. Our studies have shown two basic classes that allow application–wide monitoring. These are application instrumentation and application description.

Application instrumentation means insertion of specialized management code directly into the application’s code. The required information is sent to management systems by using some kind of well–defined interface. This approach can deliver all the service–oriented information needed by an administrator. The actual status of the application and the actual duration of transactions is measured and any level of detail can be achieved. Subtransactions within the user transactions can be identified and measured. However, application instrumentation is not very commonly used today. This is mainly due to the complexity and thus the additional effort posed on the application developer. The developer has to insert management code manually when building the application. Subtransactions have to be correlated manually to higher–level transactions. As the source code is needed for performing instrumentation, it definitely has to take place during development

Examples for approaches using application instrumentation are the Application Response Measurement API (ARM) jointly developed by HP and Tivoli and the Application Instrumentation and Control API developed by Computer Associates. Both approaches have recently been standardized by the Open Group. ARM defines a library that is to be called whenever a transaction starts or stops. Subtransactions can be correlated using so called correlators. Thus the duration of the transaction and all subordinate transactions can be measured. AIC in contrast was not explicitly developed for performance measurement but might be used in this area as well. It defines an application library to provide management objects that can transparently be queried using a client library. Additionally, a generic management function can be called through the library and thresholds of certain managed objects can be monitored regularly. Both ARM and AIC suffer from all the problems mentioned above and thus are not in widespread use today.

As most of the applications in use today somehow deliver status information but are not explicitly instrumented for management, application description techniques can be used. As opposed to the instrumentation approach, no well–defined interface for the provisioning of management information exists. The description therefore comprises where to find the relevant information and how to interpret it. Examples might be scanning of log files or capturing status events generated by the application. The major advantage of application description techniques is that it can be applied to legacy applications without requiring access to the source code. It can be done by a third party after application development, while the reasonable approach again is to provide the description by the developer. Application description faces two major problems: The information available typically is not easy to map to the information needed by the administrator. Especially in the area of performance management typically only little information is available. Moreover monitors are needed to extract the information from the application. Only very little information can be gathered by standard monitors and thus specialized monitors must be developed for every application. The most prominent representative of application description suited for performance monitoring is the Application Management Specification (AMS). Most other approaches, like the CIM Application Schema mainly focus on configuration management. An example for a tool making use of application description is Tivoli Business System Manager, which reads in AMS based Application Description Files (ADF) to learn about the application or business system to be managed.

Conclusion

The root cause of application performance is not a trivial question. Software code, system architecture, server hardware and network configuration can all impact an applications performance. These tools are developed to provide specific information on how the system as a whole is behaving.

Application Performance tools will determine the client and server's performance,
network bandwidth and latency. They provide a map of the conversation where suspect code can be evaluated. The most sophisticated tools in this area have a high level of drill down capabilities and can provide information gathered from both Client and Server or Server-to-Server conversations. Network Administrators as well as Development may use these tools. Slow user response time can be costly and even though it may be a small part of the user day, it can really add up when it impacts hundreds of users.

Monitoring a variety of application metrics in production can help you understand the status of the components within an application server environment, from both a current and historical perspective. As more back end resources and applications are added to the mix, you need only to instruct the application monitoring tool to collect additional metrics. With judicious planning and the right set of data, proactive monitoring can help you quickly correct negative application performance, if not help you avoid it altogether.

Proactive monitoring provides the ability to detect problems as they happen, and fix them before anyone notices. If problems are going to happen, its better that to find them before the customer does.

1 comment:

Anonymous said...

Good words.