1 Introduction
The arithmetic midrange of a finite collection of real numbers is defined as the mean of the extremal values: . This number can also be uniquely characterized as the solution to the optimization problem
(1) 
One can also characterize the midrange as the limit of an inductive procedure on the input data points in the following way. The merits of such a characterization will become clear in due course.
For any , define the curve by . For any initialization , we may generate a sequence as follows. Given a point :

choose a point such that for all ;

set .
Any sequence generated as above converges to the midrange of .
Denote by and the minimum and maximum values within and write . For any sequence generated as above, there exists such that and . Such an can easily be found by taking a sufficiently large . We then have
which combine to give
Thus,
so that . By symmetry in the reasoning, we also have . By induction on , we obtain
(2) 
for all . Thus, we see that the subsequence converges to at a rate of . Since a similar result holds for the subsequence , we conclude that converges to the midrange at the given rate. ∎
Although the midrange is clearly sensitive to outliers, it can be a useful measure in some circumstances. In particular, if the data points
are uniformly spread within a domain, the mean coincides with the midrange and can thus be computed using only extremal values without sifting through all the points. Similarly, the midrange can be useful for analyzing data that is devoid of outliers and finds applications in clustering algorithms that rely on the isolation of outlying clusters (Steinley (2006); Carroll and Chaturvedi (1998); Stigler (2016)).In this paper, we consider midrange statistics in the cone of symmetric positive definite (SPD) matrices. Data representations based on SPD matrices arise in an enormous range of applications as covariance matrices, including braincomputer interface (BCI) systems (Rao (2013); Zanini et al. (2018)), radar data processing (Arnaudon et al. (2013)
), and diffusion tensor imaging (DTI) (
Dryden et al. (2009)). Let denote the space of real SPD matrices. The conic geometry of often renders conventional Euclidean approaches to statistics and analysis on ineffective. Indeed it is welldocumented that using Euclidean algorithms on nonlinear spaces such as often results in poor accuracy and undesirable effects, such as swelling phenomena in DTI (Arsigny et al. (2006)).Several nonEuclidean geometries have been associated with and successfully exploited in various applications. Geometries that have been studied in great detail include the logEuclidean geometry and the affineinvariant Riemannian geometry. The logEuclidean geometry is derived from the Lie group structure of under the group operation for , where and denote the usual matrix exponential and logarithm. The affineinvariant Riemannian geometry on the other hand is induced by the Riemannian metric
(3) 
where and
are vectors belonging to the tangent space
at . This smooth metric structure induces a welldefined distance function given bywhere denote the eigenvalues of . Moreover, the unique lengthminimizing geodesic from to takes the form of the curve , where
(4) 
for (Bhatia (2003)
). The midpoint of this geodesic defines the geometric mean
of and . The geometric mean of SPD matrices can be defined as the unique solution to(5) 
which is also known as the Karcher mean (Moakher (2005)).
An important property of the Riemannian geometry defined by (3) is affineinvariance, whereby congruence transformations form isometries of . In particular, for any matrix in the general linear group , we have for all . Affineinvariance of algorithms at the level of SPD matrices corresponds to invariance under affine transformations of the underlying feature vectors, which is often a desirable property in applications involving covariance matrices.
A nonRiemannian affineinvariant geometry associated with is that induced by the Thompson metric on the cone of positive semidefinite matrices (Thompson (1963); Lemmens and Nussbaum (2012)). The Thompson metric on takes the form
(6) 
where denotes the maximum eigenvalue of . Note that the Thompson metric can be rewritten as , which justifies the notation . The pair constitutes a complete metric space of nonpositive curvature (Bhatia (2003)). While the Riemannian geodesic (4) is also a lengthminimizing geodesic of the Thompson metric, it is known that generally admits infinitely many geodesics between a pair of points (Nussbaum (1994)). In particular, the curve , defined below is a geodesic of from to , which satisfies a number of attractive properties. Let and denote the maximum and minimum eigenvalues of . If , we define
(7) 
and otherwise. One of the attractive properties of (7) is that its midpoint scales geometrically:
and coincides with the geometric mean when .
Computationally, the Thompson geodesic (7) is considerably less expensive to construct than the Riemannian geodesic (4), particularly for high dimensional matrices. This is a consequence of the fact that only relies on the computation of extremal generalized eigenvalues of the pair , which can be computed efficiently by several algorithms (Golub and van der Vorst (2000); Stewart (2002); Mishra and Sepulchre (2016)). Similarly, the computation of the Thompson distance between a pair of SPD matrices only relies on the evaluation of extremal generalized eigenvalues, whereas the Riemannian distance involves the full generalized eigenspectrum of the pair of points.
In the case of positive scalars , affineinvariance simply reduces to invariance under scaling in the cone of positive real numbers. The affineinvariant midrange of a collection of positive scalars is then the geometric mean of the extremal values, which can be formulated as the unique solution of the optimization problem
(8) 
The affineinvariant SPD matrix analogue of the above optimization formulation of the geometric midrange takes the form
(9) 
which can be recast as the convex optimization problem
(10) 
where denotes the matrix Löwner order (Mostajeran et al. (2019a, b)).
Although (10) can be solved using standard convex optimization packages, the problem does not scale well with the dimension or the number of data points . Thus, the aim of this work is to develop an alternative route to a notion of a geometric midrange of SPD matrices that has favorable computational properties that scale with the dimension . It is in this context that we investigate the natural generalization of the inductive procedure for computing the midrange of scalars outlined in Proposition 1 to the matrix setting as an alternative definition of the geometric midrange of a collection of points in . By using the Thompson metric to compute the distances in the first condition of Proposition 1, and the Thompson geodesics
as the interpolating curves in the second condition, we arrive at an algorithm that relies only on the computation of generalized extremal eigenvalues. This algorithm and its computational properties are investigated in the following sections.
2 Inductive Midrange
The inductive midrange (IMR) algorithm takes as input a set of symmetric positive definite (SPD) matrix data , as well as an initialization point which may or may not be a member of the data set, and iterates towards a representative midrange centroid of the data set. Given an initial point , a sequence is generated according to the following process.

Choose a point such that for all .

Set .
In pseudocode, the algorithm can be represented as:
where is the weighted geometric midrange
(11) 
with and . As shown in Lim (2013), the point is a Thompson metric midpoint of and . That is,
The effect of the weighted midrange in the IMR algorithm is to produce step sizes that decrease as . As outlined above, these steps are restricted to the direction of the furthest data point from each successive IMR candidate, reflecting the original intent of midrange statistics to primarily account for data outliers.
We emphasize that the convergence point of the IMR is not equivalent to the optimization midrange discussed in Section 1. As a demonstrative example, we evaluate both midranges on the data set:
The optimization midrange () and IMR midrange () are evaluated to be:
While the midrange attains the minimum costfunction value of 0.790, the IMR has a nearby costfunction evaluation of 0.811, which represents a less than increase. The Thompson distance between these two midrange matrices is 0.33, which is modest in comparison to the average separation between data set matrices. Thus, we observe an example of the interesting phenomenon that equivalent characterizations of a mathematical object in a linear space can generalize to distinct notions in nonlinear spaces.
The primary observed features of the IMR algorithm are a universal convergence rate, initializationinvariance, and dependence exclusively on a subset of the input data called active data. We will demonstrate each of these features in turn through numerical studies on a broad range of data set sizes and matrix dimensions .
2.1 Numerical Results
The IMR algorithm converges at a rate of regardless of the size or matrix dimensionality of the data set.
We observed that the IMR algorithm converges in all of our numerical experiments. The convergence rate of the IMR algorithm is assessed by measuring the Thompson distance between the IMR estimate at each step and the final IMR value after
steps. Although we do not present an analytic expression for the IMR convergence point in this paper, we take as an acceptable estimate to the true convergence point for the purposes of establishing a convergence rate.Fig. 1 (a) shows a typical example of the convergence measure throughout an IMR run for , , where is the number of data points in . After iterations, the error inherent in our estimate of the true convergence point leads to nonlinearity in this measure. Therefore the range to of iterations is excluded from fitting, and an averaged trend of convergence is demonstrated. The three colors in the plot correspond to three of the input data which are taken to initialize the IMR algorithm; the remaining two points converge trivially. Fig. 1 (b) shows a typical example of the same kind for , , with only one initialization included.
Table 1 aggregates observed convergence rates for setups with larger input data sets or greater matrix dimensionality. Average convergence rates for 10 runs () are given, with the fit excluding nonlinearity past iterations. Random SPD data are generated via the transposeproduct method, as are all other SPD data in this paper unless otherwise specified.
(5,5)  (5,20)  (50,5)  (50,20)  

Rate  0.9942  0.9932  0.9965  1.0019 
For a given data set, the IMR algorithm converges to the same SPD matrix regardless of initialization.
The input data and IMR trajectories of SPD matrices can be visualized in a cone in as described by Mostajeran and Sepulchre (2018), according to the bijection:
(12) 
Fig. 2 shows a successful IMR run for and depicted in a 2D projection of the cone in . Although the figure depicts a 2D projection for simplicity, the convergence is observed in the full threedimensional space.
The top and righthand input data points move towards each other to their mutual midpoint during the first IMR step, and therefore trivially converge to a single result; however, it is less trivial that the bottom input data point converges to the same result.
In Fig. 1 (c) the pairwise Thompson distances between an arbitrary reference initialization and all others are plotted throughout a , IMR run. Since different initializations lead to IMR paths which are drawn to some data points more often than others, quasiperiodic features dominate the pairwise distances. Despite this, all pairwise distances appear to converge to zero with a rate.
To further demonstrate invariance beyond only initializations from the input data set, IMR runs were performed with randomly generated SPDs as initializations for several data set configurations. In all cases, a maximum Thompson distance separation could be identified within which all IMR results could be bounded for the chosen and initializations. By increasing , the maximum separation could be reduced arbitrarily. The average separation between the trajectories after steps was significantly smaller than the maximum separation in all cases, as shown in Table 2.
(2,5)  (5,5)  (20,5)  (100,5)  
Max separation  .0018  .0310  .0864  .1453 
Average separation  .0008  .0072  .0188  .0277 
Convergence count (/100)  100  100  100  100 
The IMR convergence point depends exclusively on a subset of the initial data called the active data.
This property is essential for the IMR to be interpreted as a midrangetype centroid for a data set. For a data set , we define the “external” data as the subset of all data points that for some choice of initialization and ,
(13) 
Only the external data are available for the IMR algorithm to target and step towards on each iteration. As the size of an SPD data set increases, it becomes more difficult to arrange them in such a way that increases.
However, simulations demonstrate that the IMR need not be dependent on all data in in the limit of infinite time, and in fact is generally dependent on a subset which may be significantly smaller – we term this subset the active data.
The properties of active and external data are observed even in very small data sets. If the righthand point in Fig. 2 is moved leftward, a sharp transition occurs beyond which the IMR is dependent only on the remaining two data points, and can be shown to occur at their geometric midrange. Fig. 3 shows a more typical example, in which any or all of three input data internal to the data set (blue) may be removed without affecting the IMR convergence algorithm, and four external but nonactive data points (black) may also be removed without affecting the IMR convergence point. In this example, there are only three active (red) data points.
The IMR algorithm may therefore be made more efficient by advance identification of all active points and restricting the max distance search to only target this subset. Data sets with extreme outliers will tend to have fewer active points, and the IMR would accordingly experience a dramatic speedup in such cases.
3 Midrange Clustering on Matrix Data
Clustering methodologies using nonlinear geometries related to eigenspectra of SPD matrices have shown promise. For example, see the adaptation of means clustering to several similarity measures considered by Stanitsas et al. (2017), which include the affineinvariant Riemannian (AIRM) and logEuclidean metrics. See also Nielsen and Sun (2017) for an account of clustering in Hilbert simplex geometry. In this work, we consider means clustering with the Thompson distance as the similarity measure, and employ the IMR to calculate centroids for clusters. We further employ means and means++ algorithms with geometric midrange statistics and evaluate their performance.
The means clustering algorithm may be summarized as follows, where the inputs are a set of SPD data and a desired quantity of clusters (for a more thorough review, see Steinley (2006)):

Assign an initial cluster label to each . The set of points with label define a cluster .

For each cluster , a centroid is calculated (here via IMR).

For each data point , the nearest centroid is identified, and the cluster label is reassigned .

Repeat steps 2 and 3 until the cluster labels do not change between iterations.
Two straightforward initialization methods are random assignment of labels, and random selection of data points as initial centroids, where labels are then assigned as in step 3. Both of these methods can have severe limitations due to false merging of clusters, where the severity decreases with the connectedness of the data set. In a highly disconnected data set, a true cluster which does not initially contain at least one centroid is unlikely to be accurately identified during the course of the algorithm. Oversplitting of clusters is a less severe but still significant source of error. Nearperfect accuracy can be achieved given advance knowledge of approximate cluster locations, so that one initial centroid can be placed in each; however this is not the usual case in applications, where the optimal value of is often not even known.
3.1 means Clustering
means is an iterative extension of means, first proposed by Pelleg and Moore (2000), which resolves the issue of erroneously merged clusters by repeated binary splits. In this work the authors make use of the Bayesian information criterion (BIC) to accept or reject these binary splits, as described in the following outline:

Set to an initial lowerbound value .

Perform means clustering on .

For each cluster , generate two initial split centroids and perform means with on alone.

If the BIC score for a split cluster exceeds that of the unsplit cluster, accept that split.

Repeat steps 24 until no binary splits are accepted.
In our means implementation, a low (empirical) limit is placed on the number of attempted splits that may be performed on a particular cluster. BIC scoring is modified to include a Thompsonbased maximum likelihood estimator. We also split centroids following Thompson geometry, by randomly selecting one matrix on a fixedradius sphere from the unsplit centroid, and taking this SPD and its antipodal point on the sphere as the initial split centroids before performing the intermediary clustering. sphere sampling is performed by generating SPDs in a nearradially symmetric distribution around the identity. As the geodesic given by (7) is parameterized proportionally to Thompson distance, a retraction (or extension) along the geodesic is then made such that the generated point has the desired radius from the identity. By affine invariance of the metric, any point generated on the sphere may then be transported to the desired centroid by a congruence transformation .
Fig. 4 (a) shows representative means results for 200 SPD data in 10 disjoint clusters, visualized in the cone projection. These disjoint clusters are created by generating 10 cluster centers with the requirement that all pairwise Thompson distances are at least 1. For each cluster center, 20 data points are uniformly generated on a sphere of radius 0.2 around the cluster center. Although oversplitting can still occur, as exemplified by the centerleft cluster in the plot, means massively reduces the risk of falsely merged clusters. Only one falsely merged cluster appears: the centermost cluster in this figure, where the bounding boxes of the two true clusters overlap.
For higherdimensional SPDs, clustering results are best visualized in tabular form; for a selection of matrix dimensionalities of up to 20, means accuracy with constructed data sets are summarized in Table 3, averaged over 20 runs. In Table 3, ‘points identified’ refers to the number of points that were assigned to the correct cluster. ‘Clusters identified’ refers to the number of clusters that were precisely identified in the sense that every point in the original assignment was correctly identified and no other points were identified erroneously. ‘Clusters lost’ refers to the number of clusters that were not detected due to merging with nearby clusters. Due to volumetric scaling that naturally occurs with increased dimensionality, more accurate clustering is observed for high dimensions. Significant falsemerging of clusters in the case appears to be attributable to poor performance of the BIC for small matrices, which gives a lower likelihood of splitting during step 4 of the algorithm.
2  5  10  20  

Points identified (/200)  109.2  172.0  170.0  200.0 
Clusters identified (/20)  4.3  8.2  8.3  10.0 
Clusters lost (/20)  4.5  1.4  1.5  0.0 

3.2 means++ Clustering
A perhaps less invasive approach, rather than iterating the entire means procedure to reassign relatively few cluster labels, is to directly modify the initialization procedure which is the source of many oversplitting and falsemerging errors. Arthur and Vassilvitskii (2007) achieve this by constructing a new initialization procedure from the ground up, which relies on no prior knowledge of clusters, other than the desired quantity of clusters. The resultant means++ initialization is as follows:

Select an initial centroid at random from the data points.

For each data point , assign a weight .

Select another centroid among the data at random according to these (normalized) weights.

Repeat steps 23 until centroids have been chosen.
This procedure gives a vanishing likelihood that more than one initial centroid from the same cluster will be selected, and also can give such a strong initial guess at the clustering that means requires far fewer iterations to converge. In our implementation, the Thompson distance is employed for weight assignments.
Fig. 4 (b) shows a typical means++ result for the same 200 SPD data as in the above means example. No oversplitting or false merging occurs. The centermost pair of clusters, which were falsely merged in the means run, are now distinguished into two clusters in a qualitatively sensible way despite the overlap.
Clustering results are again obtained for a selection of matrix dimensionalities up to 100 and averaged over 20 runs. The means++ performance with 200 data points and 10 clusters is summarized in Table 4.
2  5  10  20  100  

Points identified (/200)  186.2  190.5  188.5  193.2  193.9 
Clusters identified (/10)  8.5  8.9  8.8  9.3  9.3 
Clusters lost(/10)  0.5  0.3  0.5  0.3  0.3 

4 Conclusion
We have introduced a novel inductive geometric midrange algorithm based on the Thompson geometry of the cone of positive definite matrices of a given dimension. The formulation is the natural generalization of the inductive characterization of the midrange of data points in linear spaces and has attractive computational properties. Important theoretical questions remain. For instance, how can one efficiently identify the subset of active data from a given data set before implementing the IMR algorithm on the full set. Such an identification will result in dramatic improvements in the efficiency of the algorithm for large data sets, since it would generally remove the need for many unnecessary distance computations.
References
 Arnaudon et al. (2013) Arnaudon, M., Barbaresco, F., and Yang, L. (2013). Riemannian medians and means with applications to radar signal processing. IEEE Journal of Selected Topics in Signal Processing, 7(4), 595–604. doi:10.1109/JSTSP.2013.2261798.
 Arsigny et al. (2006) Arsigny, V., Fillard, P., Pennec, X., and Ayache, N. (2006). Logeuclidean metrics for fast and simple calculus on diffusion tensors. Magnetic Resonance in Medicine, 56(2), 411–421. doi:10.1002/mrm.20965.
 Arthur and Vassilvitskii (2007) Arthur, D. and Vassilvitskii, S. (2007). Kmeans++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACMSIAM Symposium on Discrete Algorithms, 1027–1035. Society for Industrial and Applied Mathematics, USA.
 Bhatia (2003) Bhatia, R. (2003). On the exponential metric increasing property. Linear Algebra and its Applications, 375, 211 – 220.

Carroll and Chaturvedi (1998)
Carroll, J.D. and Chaturvedi, A. (1998).
Kmidranges clustering.
In A. Rizzi, M. Vichi, and H.H. Bock (eds.),
Advances in Data Science and Classification
, 3–14. Springer Berlin Heidelberg, Berlin, Heidelberg.  Dryden et al. (2009) Dryden, I.L., Koloydenko, A., and Zhou, D. (2009). Noneuclidean statistics for covariance matrices, with applications to diffusion tensor imaging. The Annals of Applied Statistics, 3(3), 1102–1123.
 Golub and van der Vorst (2000) Golub, G.H. and van der Vorst, H.A. (2000). Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics, 123(1), 35 – 65. doi:10.1016/S03770427(00)004131. Numerical Analysis 2000. Vol. III: Linear Algebra.
 Lemmens and Nussbaum (2012) Lemmens, B. and Nussbaum, R. (2012). Nonlinear PerronFrobenius Theory. Cambridge Tracts in Mathematics. Cambridge University Press. doi:10.1017/CBO9781139026079.
 Lim (2013) Lim, Y. (2013). Geometry of midpoint sets for Thompson’s metric. Linear Algebra and its Applications, 439(1), 211 – 227. doi:10.1016/j.laa.2013.03.012.
 Mishra and Sepulchre (2016) Mishra, B. and Sepulchre, R. (2016). Riemannian preconditioning. SIAM Journal on Optimization, 26(1), 635–660. doi:10.1137/140970860.
 Moakher (2005) Moakher, M. (2005). A differential geometric approach to the geometric mean of symmetric positivedefinite matrices. SIAM J. Matrix Anal. Appl., 26(3), 735–747. doi:10.1137/S0895479803436937.
 Mostajeran et al. (2019a) Mostajeran, C., Grussler, C., and Sepulchre, R. (2019a). Affineinvariant midrange statistics. In F. Nielsen and F. Barbaresco (eds.), Geometric Science of Information, 494–501. Springer International Publishing.
 Mostajeran et al. (2019b) Mostajeran, C., Grussler, C., and Sepulchre, R. (2019b). Geometric matrix midranges. arXiv preprint arXiv:1907.04188.
 Mostajeran and Sepulchre (2018) Mostajeran, C. and Sepulchre, R. (2018). Ordering positive definite matrices. Information Geometry, 1(2), 287–313. doi:10.1007/s4188401800037.
 Nielsen and Sun (2017) Nielsen, F. and Sun, K. (2017). Clustering in Hilbert simplex geometry. arXiv preprint arXiv:1704.00454.

Nussbaum (1994)
Nussbaum, R.D. (1994).
Finsler structures for the part metric and Hilbert’s projective metric and applications to ordinary differential equations.
Differential Integral Equations, 7(56), 1649–1707. 
Pelleg and Moore (2000)
Pelleg, D. and Moore, A. (2000).
Xmeans: Extending kmeans with efficient estimation of the number of
clusters.
In
In Proceedings of the 17th International Conf. on Machine Learning
, 727–734. Morgan Kaufmann.  Rao (2013) Rao, R.P.N. (2013). BrainComputer Interfacing: An Introduction. Cambridge University Press. doi:10.1017/CBO9781139032803.

Stanitsas et al. (2017)
Stanitsas, P., Cherian, A., Morellas, V., and Papanikolopoulos, N.
(2017).
Clustering positive definite matrices by learning information
divergences.
In
2017 IEEE International Conference on Computer Vision Workshops (ICCVW)
, 1304–1312. doi:10.1109/ICCVW.2017.155.  Steinley (2006) Steinley, D. (2006). Kmeans clustering: A halfcentury synthesis. The British journal of mathematical and statistical psychology, 59, 1–34. doi:10.1348/000711005X48266.
 Stewart (2002) Stewart, G.W. (2002). A Krylov–Schur algorithm for large eigenproblems. SIAM Journal on Matrix Analysis and Applications, 23(3), 601–614. doi:10.1137/S0895479800371529.
 Stigler (2016) Stigler, S.M. (2016). The seven pillars of statistical wisdom. Harvard University Press.
 Thompson (1963) Thompson, A.C. (1963). On certain contraction mappings in a partially ordered vector space. Proceedings of the American Mathematical Society, 14(3), 438–443.
 Zanini et al. (2018) Zanini, P., Congedo, M., Jutten, C., Said, S., and Berthoumieu, Y. (2018). Transfer learning: A riemannian geometry framework with applications to braincomputer interfaces. IEEE Transactions on Biomedical Engineering, 65(5), 1107–1116.
Comments
There are no comments yet.