Impact of Dataset Scaling on Hierarchical Clustering: A Comparative Analysis of Distance-Based and Ratio-Based Methods

,


INTRODUCTION
Agglomerative hierarchical clustering is a widely employed technique in data analysis and machine learning for grouping similar data points into clusters in a hierarchical manner.This method begins with each data point as a separate cluster and iteratively merges clusters based on a chosen linkage criterion until a single cluster encompassing all data points is formed [1].
To link the object together in a clustering algorithm, proximity measures are employed.The proximity measure (similarity/dissimilarity) are calculated using features or parameters in the dataset.To get the dissimilarity between two cases in the study (case i and j), knowing that each case includes p parameters {( 1 ,  2 , … ,   ) and ( 1 ,  2 , … ,   )}, the dissimilarity measure can be considered as the difference between these p parameters in case i and j.One commonly used dissimilarity measure is Euclidean distance, which is the root of square differences between p parameters.The Euclidean distance [2] between them can be calculated as: (  ,   ) = √ ( 1 −  1 ) 2 + ( 2 −  2 ) 2 + ⋯ + (  −   ) 2 The Euclidean distance is a measure of dissimilarity which can be calculated for distance between single cases.In agglomerative hierarchical clustering, only in the first step all cases are located in a cluster with a single case [2].
The closest or the most similar cases are merged in each step of agglomerative hierarchical clustering.The distance matrix for all pairs of observations is calculated.The two cases which have the least pairwise distance is merged.In the next step one of the clusters includes more than 1 case.The dissimilarity of other clusters with this cluster requires a linkage [3].Commonly used linkage methods are average linkage, single linkage, complete linkage method [3].Assume case 1 and 2 are merged in the first step and now they are in one cluster the distance between them and case 3 can be calculated by using average linkage [(1,2), ( 4)] given by: [(1,2), (4)] = ( 1 , 4 )+( 2 , 4 ) 2 (2 where d is the Euclidean distance.In the next step if the object formed by observations {1,2, 4} has the least average linkage compared with other objects, then the average linkage is calculated as: ((1,2,4), (3)) = ( 1 , 3 )+( 2 , 3 )+( 4 , 3 ) 3 (3) If cluster 1 includes observations {1 , 2} and cluster 2 includes {3, 4} then the average linkage will be: On the other hand, single linkage takes the minimum distance of observations in cluster 1 and observations in cluster 2. The complete linkage takes the maximum distance between cases in cluster 1 and cluster 2 and considers it as the node level.In agglomerative hierarchical clustering objects are merged step by step until all the cases are located in one cluster.
To evaluate the structure of the dendrogram in the hierarchical clustering algorithm, all the pairs of objects in terms of their Euclidean distance and the ultrametric distance which is derived from their node level in the dendrogram are considered.Kendall's tau [4], Goodman-Kruskal's coefficients [5] are rank based measures which are used as goodness of fit measure for hierarchical clustering, and both are useful for evaluating the hierarchical clustering structure.
Since GK considers only comparable pairs, it is more appropriate to be used as a goodness of fit measure compared with KT which considers all possible pairs including ties and non-comparable pairs.
Thus, the primary objective of this study is to compare various hierarchical clustering methods while taking into account the influence of different dataset scaling techniques.Our observations indicate that the choice of scaling method significantly affects the structure of hierarchical clustering.The secondary goal of this study is to compare the outcomes of different hierarchical clustering methods with those obtained in a recent study conducted by Roux [6].
Additionally, is to reevaluate the conclusions drawn in Roux's study in light of the findings from our own investigation.To ensure a fair comparison, the same datasets was employed as those utilized by Roux [6].

SCALING METHODS
The dataset can be normalized or standardized by several methods.In this study the real datasets used by scaling them using the following scaling techniques: i.
Mean Absolute Deviation from the Median: This scaling method involves calculating the absolute difference between each data point and the median of the dataset, then finding the average of these absolute differences.It measures the average dispersion of data points around the median.Suppose   is the scaled data point,   is the original data points and  ̂ is the median of   , the scaled mean absolute deviation from the median can be computed using: ii.
Median Absolute Deviation: This scaling method involves finding the median of the absolute differences between each data point and the median of the dataset.It quantifies the spread of data points from the median while being robust to outliers.Suppose   is the scaled data point,   is the original data points and  ̂ is the median of   , the scaled median absolute deviation can be computed using: ). iii.
Interquartile Range: The interquartile range (IQR) is a scaling method that measures the spread of data by calculating the difference between the third quartile (Q3) and the first quartile (Q1) in a dataset.It is a measure of the middle 50% of the data's distribution and is also robust to outliers.Suppose   is the scaled data point,   is the original data points and  ̂ is the median of   , the scaled interquartile range can be computed using: iv.

Standard Deviation:
The standard deviation is a widely used scaling method that measures the average deviation of data points from the mean (average) of the dataset.It provides a comprehensive assessment of data dispersion, but it can be sensitive to outliers.
Suppose   is the scaled data point,   is the original data points and  ̂ is the median of   , the scaled interquartile range can be computed using: No Scaling: In this case, "no scaling" means that the data is used as is, without any specific scaling applied.This can be useful when the data is already on a compatible scale or when scaling isn't considered necessary for the analysis.Suppose   is the scaled data point,   is the original data points the no scaled can be computed using: These scaling were used for various types of hierarchical clustering algorithms.To see the influence of various types of scaling on the structure of the hierarchical clustering result.

Real datasets:
Two real datasets were used in this study which are the same as the datasets which were used by Roux [6].

Pottery dataset:
The chemical dataset of Romano-Bristish pottery [7].This dataset includes 48 cases.Three of these are unusable, thus they are removed from the data.Therefore, the analysis data consists of 45 cases with 9 quantitative variables.The data were first standardized by various types of scaling and then used in hierarchical clustering.

Fisher's Irish dataset:
This dataset is a widely used dataset in various studies of statistical analysis [8].It includes 150 cases and four parameters of Sepal Length, Sepal Width, Petal Length and Petal Width.This data is also standardized with various scaling methods and used in hierarchical clustering.

Hierarchical clustering
In hierarchical clustering, the data points are initially treated as individual clusters, and then they are successively merged or divided based on their similarity or dissimilarity to form a tree-like structure known as a dendrogram.In this study two forms of agglomerative hierarchical clustering were considered.

Hierarchical clustering with distance metrics:
Various linkage methods are employed for merging the objects together in hierarchical clustering.
In this study Average linkage, single linkage, complete linkage, centroid and Median method have been used as metric for merging the observations.i.
Average linkage: Average linkage clustering, also known as UPGMA (Unweighted Pair Group Method with Arithmetic Mean), is a hierarchical agglomerative clustering method used in data analysis.It is primarily employed for grouping data points into clusters based on their similarity or dissimilarity.The average linkage clustering is computed using: ii.
Single Linkage: Single linkage clustering, also known as single-link clustering or nearestneighbor clustering, is a hierarchical agglomerative clustering method used in data analysis and data mining.In single linkage clustering, data points or objects are initially treated as individual clusters, and at each step of the clustering process, the two closest clusters are merged into a single cluster.The single linkage clustering is computed using: iii.
Complete Linkage: Complete linkage clustering is a hierarchical agglomerative clustering method used in data analysis and machine learning.It is a bottom-up approach where data points are initially treated as individual clusters and are successively merged into larger clusters based on their pairwise dissimilarity or distance.The complete linkage is computed using: v.
Centroid method: Unlike the four methods above, the centroid method is derived from distance which is the distance between the centroid of objects in one cluster with the centroid of objects in another cluster.
where d is still the Euclidean distance,  1 and  2 are two clusters, ̅ 1 is the centroid of objects in  1 and ̅ 2 is centroid of objects in  2 .The centroid mean or centers are computed using:

Hierarchical clustering with ratio-type metrics:
This study does not primarily focus on methods that involve ratios.To make meaningful comparisons, a specific type of hierarchical clustering method was only, known as relative hierarchical clustering, as presented by Mollineda & Vidal [9].This method, which is classified as a ratio-based approach, was identified as the top performer for agglomerative hierarchical clustering in both the real datasets (Pottery and Irish) studied by Roux [6].In the method introduced by Mollineda & Vidal [9], it's important to note that it doesn't just take into account the dissimilarity between clusters when merging objects.It also factors in the distances between the clusters and the other clusters in the denominator of the dissimilarity measure.This relative distance metric calculates dissimilarity by considering the distance between two objects divided by the minimum of the average distances of each object with the other clusters.This calculation is referred to as the isolation function.
where C is the clusters at the current step of hierarchical clustering and NC is the number of clusters.it is minus two because (, ) = 0 and (, ) are not added in the summation in above formula so it subtract 2 to make it average of distance with the rest of clusters.Isolation function is not symmetric.(, ) is not equal with (, ).But the relative distance is symmetric and it considers the minimum of (, ) and (, ) in the denominator.
ℎ ( ,  ) = (, ) min {(, ), (, )} The disadvantage of ratio-type methods is that there can be branch crossing in the dendrogram in this type of method.It means that a level can have higher relative distance than the next level.
While in distance type methods there is no branch crossing in the dendrogram.

Goodness of fit measure:
Two goodness-of-fit measures in this analysis was considered.These measures rely on the ranking of values when comparing the distances between two objects and their ultrametric distances within the hierarchical clustering.For a set of four values (a quadruple) to be considered as "concordant," it means that there is a consistent pattern in the signs when comparing the distances between them and their ultrametric distances.In other words, if we're comparing the distances of objects i and j to those of objects k and l, for them to be considered concordant: • If the signs of both the distance comparisons (i to j and k to l) are the same, and ( , ) < (, ) => (,  ) < (, ) • If the signs of both the ultrametric distance comparisons (i to j and k to l) are also the same,  cannot be compared directly.This is because both cases 1 and 2 belong to the same cluster, and they are connected to object 3 at the same hierarchical level.Consequently, their ultrametric distances are the same.However, among these four cases, only one discordant quadruple can be identified, namely, (2,3) and (2,4).In this case, the discordance arises because the distance between object 2 and 3 is smaller than the distance between object 2 and 4.However, the ultrametric distance between object 2 and 4 is smaller than that between object 2 and 3. Kendall's Tau, a correlation coefficient, considers all 15 possible quadruples in its denominator when calculating its value.This means it considers all potential combinations for comparing the ranks.
On the other hand, the Goodman-Kruskal index does not include non-comparable (cases where objects are at the same hierarchical level) and tied quadruples in its calculations.It focuses on the comparable cases, where meaningful comparisons can be made, and excludes those cases where ranking the objects is not feasible.

RESULTS
Hierarchical clustering performed on the Pottery dataset using various distance type methods, including average linkage, complete linkage, single linkage, median linkage, and the centroid method.Also the Mollineda & Vidal [9] relative hierarchical clustering as a ratio-type method was utilized.In total, there are 45 cases in this dataset, resulting in 489,555 possible quadruples for comparison.These clustering methods was applied, in conjunction with various scaling techniques.The Goodman-Kruskal measure of goodness of fit was calculated for each method combined with each scaling approach for this dataset, and the results can be found in Table 2.In a previous study by Roux [6], the mean absolute deviation was used for scaling this dataset.It was observed that Mollineda & Vidal [9] algorithm for relative hierarchical clustering yielded the highest Goodman-Kruskal measure (0.8066), indicating superior performance compared to other methods.However, the data in Table 2 shows that when the median linkage method used, a Goodman-Kruskal measure of 0.8072, surpassing the ratio-type method.Notably, across four out of five scaling methods used for this dataset, the median linkage method consistently outperforms other clustering techniques.Regarding the structure of the dendrogram, it is worth mentioning that when using median absolute deviation for scaling, the highest Goodman-Kruskal value of 0.9233 is achieved with the average linkage method.Additionally, from the 489,555 quadruples examined, the highest number of concordant quadruples is observed with the single linkage method, while the lowest number of discordant quadruples is associated with the average linkage method.
Furthermore, when considering the Kendall's Tau measure, the single linkage method with a Tau value of 0.7191, using median absolute deviation for scaling, achieves the highest Tau measure.
Following closely is the median linkage method with a Tau value of 0.7180.In terms of the average Goodman-Kruskal measure, the average linkage method performs best, and Mollineda & Vidal [9] method in combination with median absolute deviation scaling is the next best option.
For other scaling techniques, the median linkage method consistently achieves the highest Goodman-Kruskal measure.Also, in the results presented in Table 2, the various methods evaluated by considering the diameter of the objects being connected.The diameter of objects in C1 connected with objects in C2 is defined as the maximum distance between any object in C1 and any object in C2.While the node level could be taken as the level of the dendrogram in each hierarchical clustering model, for comparison with ratio-type methods (which may involve branch crossing) and for consistency with the ultrametric distance calculation method used by Roux [6], diameter was used as the basis for evaluation in other linkage methods as well.In addition, when the diameter for evaluating the hierarchical clustering models employed, it became evident that the method proposed by Mollineda & Vidal [9] did not perform better than the Median linkage method.However, using the node level itself for evaluation instead of the diameter, the distance-based methods demonstrate superior performance when compared to ratio-type methods.
In Table 3, the results of the goodness-of-fit assessment are presented using the node level of each model.For instance, the node level for the objects in C1 connected with objects in C2 in the case of average linkage is defined as the average of the distances between objects in C1 and objects in C2.
Thus, in this context, the node level is used as an alternative to the diameter for evaluation.As shown in Table 3, when evaluating the clusters using the node level, the Goodman-Kruskal (GK) value for median linkage is the highest at 0.9287, particularly when median absolute deviation is employed for data scaling.The highest concordant quadruple is no longer observed for single linkage.Instead, the method with the most concordant quadruples is median linkage, where median absolute deviation equal 367514.This suggests that median linkage outperforms other linkage methods for this dataset.Comparing the results in Table 3 to those in Table 2, when mean absolute deviation is used for data scaling, average linkage with GK = 0.8082 performs better than Mollineda & Vidal [9] with GK = 0.8066.Additionally, Kendall's tau measure also indicates that median linkage (tau = 0.7229) exhibits the best performance, followed by average linkage (tau = 0.7221).The Goodman-Kruskal (GK) value for Mollineda & Vidal [9] is also lower when compared to Table 5.This difference in the GK value is primarily due to the presence of branch crossings in ratio-type methods.When using the node level itself to evaluate the method, it can result in a reduction in the goodness of fit measure.Among the goodness-of-fit measures, Kendall's tau demonstrates the best performance.Specifically, it shows the highest performance in Median linkage with a tau value of 0.6597.Following that, the Centroid method performs well with a tau value of 0.6508.

DISCUSSION
The focus of this paper was on hierarchical clustering, which involved the use of various linkage methods to compare the outcomes of distance-based methods with a ratio-based method introduced by Mollineda & Vidal [9].This ratio-based method was identified as the most effective approach in a previous study conducted by Roux [6].To facilitate this comparison, different scaling techniques employed on the same real datasets that were used in Roux's earlier study.
The findings revealed that the results obtained from the distance-based methods were not inferior to those from the ratio-based method.Contrary to the results reported by Roux [6], the performance of the distance-based methods outperformed the ratio-based method in both real datasets.This conclusion was supported by both the Goodman-Kruskal and Kendall's Tau measures of goodness of fit.For the Pottery dataset, the Median Absolute Deviation yielded the best results for scaling the dataset, while for the Fisher Irish dataset, the highest goodness-of-fit measures were obtained when no scaling was applied.Specifically, in the Pottery dataset, the Median linkage method performed best, followed by the Average linkage method.In the case of the Fisher Irish dataset, the Median linkage method also showed the highest performance according to Kendall's Tau, followed by the centroid method.In the evaluation of the clustering structures using diameter as a criterion, the centroid method proved to be the most effective with a Goodman-Kruskal value of 0.8696, followed closely by the Median linkage method with a Goodman-Kruskal value of 0.8669.On the other hand, when the structures using node level was assessed, the Average linkage method delivered the best performance with a Goodman-Kruskal value of 0.8714, with the centroid method as the runner-up with a Goodman-Kruskal value of 0.8707.

CONCLUSION
In summary, the study demonstrated that the distance-based methods outperformed the ratiobased method proposed by Mollineda & Vidal [9] in both datasets.Additionally, it has been found that the choice of data scaling method significantly influenced the hierarchical clustering results and selecting the appropriate scaling method for each dataset led to more consistent clustering structures.

Conflicts of Interest:
The author declares that there are no conflicts of interest regarding the publication of this paper.

Table 1 : comparison of quadruples.
quadruple i J k l d(i,j) d(k,l) u(i,j) u(k,l)

Table 3 . Hierarchical clustering using Pottery dataset.
[9]the Fisher Irish dataset.With larger sample sizes, the number of quadruples significantly increases, making the calculation of the goodness-of-fit measures more time-consuming.In this dataset, around 62 million quadruples need to be compared to calculate Kendall's Tau and the Goodman-Kruskal measure.As displayed in Table4, the highest Goodman-Kruskal index (GK) is observed in the dataset without any scaling, where GK = 0.8696 for the Centroid method, followed by GK = 0.8669 for Median linkage.Notably, when scaling the data using the Median Absolute Deviation method, the results from Mollineda & Vidal[9]outperform other methods, achieving a GK of 0.7984.The quadruples with the highest concordance are found in the case of single linkage, with a pattern of 4441142.The lowest number of discordant quadruples is observed in the Centroid method, with 3,042,850 discordant quadruples.Table4also evaluates the methods based on the diameter of the objects being connected.The highest Kendall's Tau measure is achieved by Median linkage, with tau = 0.6579, followed by the Centroid method, with tau = 0.6500.When focus shifted from diameter to node level for evaluating hierarchical clustering methods, the Average linkage method emerges as the performer with a GK of 0.8714, followed by the Centroid method with a GK of 0.8707.Median linkage shows the highest number of concordant quadruples at 44 million, while the lowest number of discordant quadruples is observed with the Average linkage method, totaling 2.99 million discordant quadruples.