Statistical Powers of Some Tests for Checking Homogeneity of Survival Distributions with Disjointed Ends in the Presence of Censoring

This considered the comparison of some tests for assessing the overall homogeneity of Kaplan-Meier survival curves under low and high censoring rates when the curves are disjointed towards the end. The performances of these tests were measured by their statistical powers. Monte Carlo simulation study was conducted to evaluate and numerically compare the relative performances of Log-rank,Wilcoxon, Tarone-Ware, Peto-Peto, Modified Peto-Peto, the Fleming-Harrington (1,1), and the Babalola-Adeleke tests. The result obtained shows that the Babalola-Adeleke and Fleming-Harrington (1,1) tests have more robust performances than the other five popular tests with relatively high power in detecting differences when the censoring rates in the groups are both low and high. The highest overall average powers under low and high censoring rates were produced by Babalola-Adeleke and Fleming-Harrington (1,1) tests respectively. Hence, these two tests are the most suitable tests for diagnosing homogeneity of survival curves under these conditions.


Introduction
The rate at which survival analysis is advancing and gaining popularity in every field of study is pretty impressive. The nature of data obtained in the area of Biostatistics has necessitated the growth in the volume of works done in the survival analysis [1][2][3][4][5]. Survival analysis is also of massive use in Engineering and Social sciences fields [6][7][8]. A very predominant method in Survival analysis is Kaplan-Meier method, which is capable of estimating the survivorship function for different sample sizes. Several scholars have established its huge efficiency in capturing necessary survival details in cohort studies and otherwise. The Kaplan-Meier estimator is a nonparametric method that allows for the incorporation of censoring for the purpose of estimation of probabilities of survival [9][10][11][12]. More related and relevant research works have also been reported in the literature.
The log-rank test is arguably the most popular test in testing for homogeneity of survival distribution. However, it may fail to recognize some crucial differences that exist among groups whereby the main difference takes place very early in the study or towards the end of the study [13].This is because it was proposed in order to give equal weight to all failures among the follow-up [14]. The shortfall of the log-rank test is in the assumption that the hazard ratio of the groups should be proportional along the follow-up period as that is the only condition that makes the test superior by using various combinations of censoring proportions, respectively. In the paper, Wilcoxon test had the lowest relative power of all tests examined. [22,23] and [24] were interested in the comparison of the Wilcoxon and the Log-rank tests under different scenarios. [25] added more tests, which are the Tarone-Ware, Peto-Peto, and F-H tests to the comparison of the log-rank and Wilcoxon tests when the sample size is quite small. It was concluded in the paper that the choice of weight function has a tremendous impact on the power of the tests under any given situation. The importance of simulations and Monte Carlo methods in modern research were the focus of [26]. [27] proposed a modified one-sample log-rank test, and a sample size formula was derived based on its exact variance to provide a study design that preserves the type I error. [28] discussed the versatile tests for comparing survival curves based on weighted log-rank statistics. [29] proposed a nonparametric test for the comparison of survival curves using the median. [30] examined the tests for comparing survival curves with right-censored data. In the study, the type I error rate of Logrank test was equal or close to the nominal value.
[31] developed a new method and demonstrated that this method outclassed some existing methods and relatively performed better under low and high censoring rates when the Kaplan-Meier survival curves are proportional. It was also ascertained that when there are crossing survival curves, the powers of the tests are relatively low since none of the tests gave statistical power in close of one.
Thus, this paper considers a typical situation whereby the survival curves of the two groups are similar at the beginning of the study but gradually diverged towards the end. The censoring rates were categorized into two parts (low and high censoring rates). The censoring times among the groups were carefully chosen to fit into the intended survival pattern. All survival times were simulated from an exponential distribution. The outcome of this study will assist researchers as a further guide for their choice of tests when survival curves are disjointed towards the end.
Hence, the novelty of this study would be in comparing the relatively new Babalola-Adeleke test with some of the popular methods for checking homogeneity of Kaplan-Meier survival curves with disjointed ends under both high and low censoring rates. It is expected that the findings of this study would help the users of survival analysis as it will certainly further expose to them performances of the tests under consideration. It will also guide in decision making when confronted with the choosing of the most appropriate test to detect differences in survival curves with disjointed ends. To the best of our knowledge, this is the first study that would compare Babalola-Adeleke test with others under this particular situation.

Methodology
Given that there are two groups, that is, groups 1 and 2, where the survival times were observed and recorded as j t . The number of observed failures (death) in group 1 and group 2 being     Based on the argument above, the test hypothesis considered is: For the test statistics of the tests, see: [39][40][41][42] and [8]. The tests are based on some assumptions namely: censoring is unrelated to prognosis; the survival probabilities are equal for subjects recruited early and late in the study; the events happened at the times specified. Hence, for the purpose of placing weights of hazard in the middle, Fleming-Harrington (1,1) was selected since every other test either places equal weight across the board or places more weight at the beginning or towards the end. Figure 1 shows the survival curves of two groups that have a similar pattern for some time but have a disjointed end. Therefore, all the simulated datasets followed this pattern.

Results
Considering the sub-situation with low censoring rates in both groups, the survival times in Group 1 follow an exponential distribution with a mean of 4 (rate 0.25), and in Group 2, the survival times follow an exponential distribution with mean 4(rate 0.25) as well. In order to get disconnected survival curves towards the end, if the survival time in Group 2 is greater than or equal to 4, then the survival time is automatically simulated from an exponential distribution with a mean 40(rate 0.025). In order to have low censoring rates in the two groups, if the survival time is greater than the maximum survival time divided by 1.25 into both groups, then the observation was censored. These yielded an overall average censoring rate of 4.50% and 9.99% in Groups 1 and 2, respectively. Table 3 displays the result of the powers of the seven tests obtained from the simulation conducted for this sub-situation under low censoring rates alongside the censoring rates. The censoring rates in both groups decrease as the sample sizes increase. The same trend is also exhibited in mixed sample sizes.  Table 3 is given in Table 4.    Table 5. Unlike the first sub-situation with low censoring rates, the censoring rates in both groups increase with sample size. Generally, the powers of all the tests are low. Even at that, the Fleming-Harrington still outperforms the other tests. As expected, the powers increase as the sample sizes increase. This could indicate that at much larger sample sizes, the powers of the tests could attain higher values than the ones reported.

Application of the tests to real-life data
Survival in patients with Acute Myelogenous Leukemia was studied with the interest of knowing the impact of the standard course of chemotherapy extension [43,44]. The variables in the study were time, which is the survival or censoring time, and event (recurrence of AML cancer) is indicated by the variable "status" 1 = event (recurrence) and 0 = no event (censored). The treatment group was represented by the variable "x", which indicates if maintenance chemotherapy was given (Maintained) or not (Non-maintained).
This is a popular data set with 8.33% patients censored in group 1(maintained) and 36.36% in the second group (non-maintained). The property of this data set is "slightly" similar to the situation under study as the survival curves have a similar pattern from the beginning of the study till about the week 45(though not exactly the same form from the beginning). Then homogeneity of the survival curves can be investigated. This is the closest real-life data we have at our disposal for the situation under study.
The test hypothesis is:  Table 7 clearly shows that all the tests validate that the Kaplan-Meier survival curves of those who were maintained and those who were not maintained are not significantly different as none of the p-values is less than 0.05. All the tests yielded very low chi-squared values. This result is consistent with the results earlier reported.

Conclusion
Generally, the powers of all the tests are low. Even at that, the Fleming-Harrington still outperforms the other tests. The powers increase as the sample sizes increase. This could indicate that at much larger sample sizes, the powers of the tests could attain higher values than the ones reported. A general comment about this situation, that is when the survival curves are separate towards the end is that, the powers of the tests are also low as expected. This means that it is quite difficult for the different tests to correctly diagnose survival curves because of the similarity of the curves for a larger part of the study (not until towards the end of the study). The low values of the power are expected, and it has been reported by other researches as well. Generally, across all the sample sizes, the overall average of the power of the entire tests combined is lower when dealing with high censoring rates (0.0874) than when dealing with lower censoring rate (0.1756).
Authors Conflicts of Interest: The authors declare that there are no conflicts of interest regarding the publication of this paper.