# Evaluation tools

## Summary of suggested tools for comparison and evaluation

**Diversity between classifications**

The following indices determining the similarity between two partitions (classifications) of one set of observations have been calculated: Rand Index [RI], Adjusted Rand Index [ARI], Jaccard Index [JI], Mutual Information [MI], Normalized Mutual Information [NMI].

**Separability and within-type variability of classifications**

Evaluation criteria are: the Pattern Correlation Ratio (**PCR**, expressed as percentages), the Within-Type Standard Deviation (**WSD**), the Explained Variation (**EV**, expressed as percentages), the Pseudo-F Statistic (**PF**) and the Silhouette Index (**SIL**).

The several criteria are briefly described in Table 1.

Evaluation criteria have been estimated based on the following variables: mean sea level pressure** MSLP**, 2m temperature** 2mT**, large scale precipitation** LSP**, convective precipitation** CP** and precipitation sum (LSP+CP) **PRCP**.

**Table 1:** Selected criteria for evaluating circulation classifications.

http://geo23.geo.uni-augsburg.de/cost733_WG3/evaluation_criteria_table.png

**Indicators used to assess links between the occurrence of a phenomenon (e.g. flood) and circulation patterns.**

- Indicator 1: frequency anomaly

Measure of contribution of a pattern type i to the occurrence of floods: relative number of days with pattern i in the N=N* days preceding the flood, compared to purely random frequency of occurrence of pattern i during the season considered. I have actually modified this measure so that negative values show an occurrence of pattern i less often during a flood than usual; positive if more often than usual. Significance is assessed using the Chi2 test.

- Indicator 2: persistence measure

Conditional probability of finding at least k days out of N* with pattern or pattern group i given that a flood occurred on day zero

Significance is assessed in comparing this conditional probability with the Binomial probability of at least k days out of N* of pattern i using historical frequencies of occurrence

- Indicator 3: Brier Skill Score (BSS)

The Brier skill score is widely used to evaluate probability forecasts, but can also be adopted for the evaluation of classifications, where it takes a particularly simple form (Schiemann and Frei, 2009):

.

Here, *N* is the total number of observations (e.g., days), *N_i* is the number of observations (days) with circulation type *i*, y_i is the relative frequency of an event (e.g., the exceedance of a threshold by some variable) during circulation type i, *o bar* is the climatological (unconditional) event frequency, and *I* is the total number of types.

**Dispersion between classifications**

Gini coefficient

The Gini coefficient method [Gini 1921] based on the Lorenc curve [Lorenc 1905] can be applied to compare CTCs. In order to calculate Gini coefficient G for some classification, the probability pi=mi/ni of occurrence days with some characteristic (e.g. high pollution concentration, large precipitation, fog ) for each class ought to be calculated and finally sorted according to rising pi.

Then

http://perswww.kuleuven.be/~u0044657/COST733/TableGini.PNG

where ni is a total number of days for class i (after sorting), mi is a number of days meeting our criteria for class i , N is a total number of days for all classes, M is a total number of days meeting our criteria for all classes and L is a number of classes.

### References

**Diversity indices**

*Hubert, L. and P. Arabie, 1985*: [http://geo21.geo.uni-augsburg.de/cost733_WG3/Literature/hubert_arabie.pdf Comparing Partitions]. Journal of Classification, 2, 193-218. (Adjusted Rand index)

*Kuncheva, L.I. and S. T. Hadjitodorov, 2004*: [http://geo21.geo.uni-augsburg.de/cost733_WG3/Literature/kuncheva_diversity.pdf Using diversity in cluster ensembles]. 2004 IEEE International Conference on Systems, Man and Cybernetics, Vol. 2, 1214-1219. (Brief discussion of diversity indices)

*Rand, W. M., 1971*: Objective criteria for the evaluation of clustering methods. J. Amer. Stat. Assoc., 66, 846–850. (Rand index)

*Southwood, T. R. E., 1978*: Ecological Methods, 2nd edn. London: Chapman & Hall. (Jaccard index)

*Strehl, A. and J. Gosh, 2002*: [http://geo21.geo.uni-augsburg.de/cost733_WG3/Literature/Strehl2002.pdf Cluster ensembles – A knowledge reuse framework for combining partitions]. Journal of Machine Learning Research, 3, 583-617. (Mutual information)

**Evaluation criteria**

*Calinski, T., and J. Harabasz, 1974*: [http://geo21.geo.uni-augsburg.de/cost733_WG3/Literature/Calinski1974.pdf A dendrite method for cluster analysis]. Commun. Stat., 3, 1–27. (Pseudo-F)

*Huth, R., 1996*: An intercomparison of computer-assisted circulation classification methods. Int. J. Climatol. 16, 893-922. (Pattern correlation ratio)

*Kalkstein, L. S., G. Tan, and J. A. Skindlov, 1987*: [http://geo21.geo.uni-augsburg.de/cost733_WG3/Literature/Kalkstein1987.pdf An evaluation of three clustering procedures for use in synoptic climatological classification]. J. Appl. Meteor., 26, 17–730. (Within-type standard deviation)

*Milligan, G., and M. Cooper, 1985*: [http://geo21.geo.uni-augsburg.de/cost733_WG3/Literature/Milligan1985.pdf An examination of procedures for determining the number of clusters in a data set]. Psychometrika, 50, 159–179. (Comparison of evaluation criteria)

*Rousseeuw, P., 1987*: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65. (Silhouette index)

**Occurrence/Frequency criteria**

*Duckstein, L., Bardossy, A. and Bogardi, I., 1993*: Linkage between the occurrence of daily atmospheric circulation patterns and floods: an Arizona case study. Journal of Hydrology, 143(3-4): 413-428.

*Lorenz, M. O., 1905*: Methods of measuring the concentration of wealth. Publications of the American Statistical Association 9, 209–219.

*Gini, Corrado, 1921*: Measurement of Inequality and Incomes. The Economic Journal 31, 124–126.