CodePlexProject Hosting for Open Source Software

- Learn the uses, meanings of, and concepts underlying common mathematical and statistical principles as they apply to performance test analysis and reporting.

Even though there is a need to understand many mathematical and statistical concepts, many software developers, testers, and managers either do not have strong backgrounds in or do not enjoy mathematics and statistics. This leads to significant misrepresentations and misinterpretation of performance-testing results. The information presented in this article is not intended to replace formal training in these areas, but rather to provide common language and commonsense explanations for mathematical and statistical operations that are valuable to understanding performance testing.

- Use the “Exemplar Data Sets” section to gain an understanding of the exemplars, which are used to illustrate the key mathematical principles explained throughout the chapter.
- Use the remaining sections to learn about key mathematical principles that will help you to understand and present meaningful performance testing reports.

- Data Set A
- Data Set B
- Data Set C

100 total data points, distributed as follows:

- 5 data points have a value of 1.
- 10 data points have a value of 2.
- 20 data points have a value of 3.
- 30 data points have a value of 4.
- 20 data points have a value of 5.
- 10 data points have a value of 6.
- 5 data points have a value of 7.

100 total data points, distributed as follows:

- 80 data points have a value of 1.
- 20 data points have a value of 16.

100 total data points, distributed as follows:

- 11 data points have a value of 0.
- 10 data points have a value of 1.
- 11 data points have a value of 2.
- 13 data points have a value of 3.
- 11 data points have a value of 4.
- 11 data points have a value of 5.
- 11 data points have a value of 6.
- 12 data points have a value of 7.
- 10 data points have a value of 8.

It is important to note that percentile statistics can only stand alone when used to represent data that is uniformly or normally distributed with an acceptable number of outliers (see “Statistical Outliers” below). To illustrate this point, consider the exemplar data sets. The 95th percentile of Data Set B is 16 seconds. Obviously, this does not give the impression of achieving the 5-second response time goal. Interestingly, this can be misleading as well because the 80th percentile value of Data Set B is 1 second. With a response time goal of 5 seconds, it is likely unacceptable to have any response times of 16 seconds, so in this case neither of these percentile values represent the data in a manner that is useful to summarizing response time.

Data Set A is a normally distributed data set that has a 95th percentile value of 6 seconds, an 85th percentile value of 5 seconds, and a maximum value of 7 seconds. In this case, reporting either the 85th or 95th percentile values represents the data in a manner where the assumptions a stakeholder is likely to make about the data are likely to be appropriate to the data.

A common rule in this case is: “Data with a standard deviation greater than half of its mean should be treated as suspect. If the data is accurate, the phenomenon the data represents is not displaying a normal distribution pattern.” Applying this rule, Data Set A is likely to be a reasonable example of a normal distribution; Data Set B may or may not be a reasonable representation of a normal distribution; and Data Set C is undoubtedly not a reasonable representation of a normal distribution.

There is no way to avoid arbitrariness in the final decision as to what level of significance will be treated as really ‘significant.’ That is, the selection of some level of significance, up to which the results will be rejected as invalid, is arbitrary.

Typically, it is fairly easy to add iterations to performance tests to increase the total number of measurements collected; the best way to ensure statistical significance is simply to collect additional data if there is any doubt about whether or not the collected data represents reality. Whenever possible, ensure that you obtain a sample size of at least 100 measurements from at least two independent tests.

Although there is no strict rule about how to decide which results are statistically similar without complex equations that call for huge volumes of data that commercially driven software projects rarely have the time or resources to collect, the following is a reasonable approach to apply if there is doubt about the significance or reliability of data after evaluating two test executions where the data was expected to be similar. Compare results from at least five test executions and apply the rules of thumb below to determine whether or not test results are similar enough to be considered reliable:

- If more than 20 percent (or one out of five) of the test-execution results appear not to be similar to the others, something is generally wrong with the test environment, the application, or the test itself.
- If a 90th percentile value for any test execution is greater than the maximum or less than the minimum value for any of the other test executions, that data set is probably not statistically similar.
- If measurements from a test are noticeably higher or lower, when charted side-by-side, than the results of the other test executions, it is probably not statistically similar.

- If one data set for a particular item (e.g., the response time for a single page) in a test is noticeably higher or lower, but the results for the data sets of the remaining items appear similar, the test itself is probably statistically similar (even though it is probably worth the time to investigate the reasons for the difference of the one dissimilar data set.

For the purposes of this explanation, a more applicable definition of an outlier from a StatSoft, Inc. (http://www.statsoftinc.com) is the following:

Outliers are atypical, infrequent observations: data points which do not appear to follow the distribution of the rest of the sample. These may represent consistent but rare traits, or be the result of measurement errors or other anomalies which should not be modeled.

Note that this (or any other) description of outliers only applies to data that is deemed to be a statistically significant sample of measurements. Without a statistically significant sample, there is no generally acceptable approach to determining the difference between an outlier and a representative measurement.

Using this description, results graphs can be used to determine evidence of outliers — occasional data points that just don’t seem to belong. A reasonable approach to determining if any apparent outliers are truly atypical and infrequent is to re-execute the tests and then compare the results to the first set. If the majority of the measurements are the same, except for the potential outliers, the results are likely to contain genuine outliers that can be disregarded. However, if the results show similar potential outliers, these are probably valid measurements that deserve consideration.

After identifying that a dataset appears to contain outliers, the next question is, how many outliers can be dismissed as “atypical infrequent observations?”

There is no set number of outliers that can be unilaterally dismissed, but rather a maximum percentage of the total number of observations. Applying the spirit of the two definitions above, a reasonable conclusion would be that up to 1 percent of the total values for a particular measurement that are outside of three standard deviations from the mean are significantly atypical and infrequent enough to be considered outliers.

In summary, in practice for commercially driven software development, it is generally acceptable to say that values representing less than 1 percent of all the measurements for a particular item that are at least three standard deviations off the mean are candidates for omission in results analysis if (and only if) identical values are not found in previous or subsequent tests. To express the same concept in a more colloquial way: obviously rare and strange data points that can’t immediately be explained, account for a very small part of the results, and are not identical to any results from other tests are probably outliers.

A note of caution: identifying a data point as an outlier and excluding it from results summaries does not imply ignoring the data point. Excluded outliers should be tracked in some manner appropriate to the project context in order to determine, as more tests are conducted, if a pattern of concern is identified in what by all indications are outliers for individual tests.

Because stakeholders do frequently ask for some indication of the presumed accuracy of test results ? for example, what is the confidence interval for these results? ? another commonsense approach must be employed.

When performance testing, the answer to that question is directly related to the accuracy of the model tested. Since in many cases the accuracy of the model cannot be reasonably determined until after the software is released into production, this is not a particularly useful dependency. However, there is a way to demonstrate a confidence interval in the results.

By testing a variety of scenarios, including what the team determines to be “best,” “worst,” and “expected” cases in terms of the measurements being collected, a graphical depiction of a confidence interval can be created, similar to the one below.

In this graph, a dashed line represents the performance goal, and the three curves represent the results from the worst-case (most performance-intensive), best-case (least performance-intensive), and expected-case user community models. As one would expect, the blue curve from the expected case falls between the best- and worst-case curves. Observing where these curves cross the red line, one can see how many users can access the system in each case while still meeting the stated performance goal. If the team is 95-percent confident (by their own estimation) that the best- and worst-case user community models are truly best- and worst-case, this chart can be read as follows: the tests show, with 95-percent confidence, that between 100 and 200 users can access the system while experiencing acceptable performance.

Although a confidence interval of between 100 and 200 users might seem quite large, it is important to note that without empirical data representing the actual production usage, it is unreasonable to expect higher confidence in results than there is in the models that generate those results. The best that one can do is to be 100-percent confident that the test results accurately represent the model being tested.

Last edited Aug 23, 2007 at 6:51 PM by prashantbansode, version 3