Bryn Mawr Classical Review 03.06.19


Ledger on Keyser

One of the enduring problems of statistics is that data are never entirely uniform and homogeneous and that any set of objects, when measured, displays variation sometimes over quite a wide spectrum. It would not be unreasonable to claim that the entire science of statistics has been developed to deal with this problem, with the attempt to predict how wide the variation might be in any given circumstances and to calculate the chances of any one set of observations falling outside a certain range.

I make this point at the outset because K. (Paul Keyser, BMCR 3.1 [1992] 58-73) claims that the occurrence of outliers in my measurements of samples from Plato and other authors would have justified my abandonment of the project at the outset. His quarrel is really with statistics as a science in general, and perhaps more seriously, with a world which has variety in it. In fact there are no statistical measurements on any topic worth consideration that do not result in some contamination between supposedly distinct groups. For this reason it is common practice among statisticians to discard outliers as being unrepresentative of the group to which they belong. Even with uni-variate statistics, theory predicts that 5% of measurements will fall more than two s.d.'s away from the norm, so it is hardly surprising, with multi-variate measurement and in a field as unpredictable and complex as writing, and at the level of detail that I chose to measure it, that overlap occurs. I would challenge any statistician to declare that the results of the cluster analyses shown on pages 45 and 52-3 (G.R. Ledger, Re-counting Plato [Oxford 1989]) are of no significance and do not give "a sufficiently sensitive measure even of gross divergence in author and genre" (p. 424). My own reaction, looking at the tree diagrams again after an interval of two years, is that they are stunningly accurate in separating the samples into the appropriate groups and that I erred in failing to be sufficiently enthusiastic at the results of these preliminary findings. No other investigations in stylometry have come anywhere near these in showing that at small sample level many works display both homogeneity and distinctiveness. Or that we have potentially, by the measurement of a few rudimentary variables, the possibility of constructing a profile for each and every author, even though it may not in every instance enable us to distinguish works infallibly. (What profile or face ever was unique or did not sometimes result in mis-recognition?) The difficulties of interpretation of my results arise because the variables I have chosen reveal too much about style, not too little.

As regards K.'s criticism of my chronological placement of the dialogues, it is important to realise that the statistic used, the first canonical variable, is a multi-variate statistic, not a uni-variate one. There are two consequences of this. Firstly, the assumption of linearity is applied in a multi-dimensional world, not two dimensional, as in the usual scenario. It may be wrong, but it is inherently more likely, in relation to complex measurements of this nature, that two works close in style (i.e., close in multi-dimensional space) are also close in date. (See also p. 176.) Why after all does the Laws always follow the Republic, and Epinomis always follow the Laws (barring one instance) according to the variables I have chosen? It can hardly be an artifact of the method used, since Epinomis is not included in any of the sets of dialogues used to choose the time-related variables, while the Republic and Laws are omitted from sets B and D (p. 179).

Secondly, the introduction of a uni-variable test statistic, which K. advocates (p. 426), as a check on the reliability of the average CAN1 scores which I use for placing the dialogues, is not appropriate. Details of the type of test which should be used are given on p. 65. Such tests give, broadly speaking, the probability that the correlation between CAN1 and group membership is spurious. In most cases this probability is close to zero but it does not follow that one should therefore slavishly adhere to the results of any one analysis on the basis of a significance test. With MVA it is nearly always possible to show a significant difference between any two works, but I would imagine that most researchers in the humanities remain unconvinced that the statisticians' definition of significance is relevant to the decision they have to make -- whether one work is later than another or which author is most likely to have written the work under scrutiny. I make no secret of disagreeing with the tradition of inferential statistics used in stylometry, and prefer to give a descriptive account of the data (p. 35).

Nevertheless, I do not think such an approach invalidates my interpretation of the chronological listings. It is surely perverse to suggest that it should not have announced such details as that Tim. and Criti. follow Laws when 99% of my listings show this to be the case. Of course some of the lists are partially dependent, and even with an infinite set of variables independence could not be guaranteed, but there is nothing in sets A, C or D which could guarantee that Tim. would be diagnosed as a late dialogue, and nothing in any of the lists which could force Criti. artificially into its frequent final position. This is not the consistency of error, as K. implies, but a fair and objective assessment of the data. I frequently refer to the fact that stylometry is no better than its variables and that my results are not infallible. But the fact is that they are better than anything that has been attempted before or since, chiefly because they do not use gross statistics (i.e., statistics averaged over entire works) which, by definition, assume a homogeneity which may not exist; they bring other authors under scrutiny alongside Plato; the variables chosen are arbitrary and not coloured by ideological considerations (e.g., what distinguishes Plato's late style); they attempt a moderate interpretation of the data without preference for one result over another. It does not matter to me greatly in what order Plato composed his works. I would cheerfully concur with the finding that he wrote Timaeus in his teens, but the statistical evidence suggests otherwise and it would have been folly on my part to fail to report it or to abandon the enquiry because of the possibility of error. If error does exist it is not on the scale suggested by K. and not for the reasons he mentions.

By all means let us have a discussion on methodological issues, but I cannot consider it to be an auspicious start when arcane references masquerade as knowledge and spurious statistical concepts are advanced as evidence of serious and corrosive flaws.

Gerard Ledger
University of Reading

Keyser on Nails on Keyser on Ledger

Discussion clarifies and I welcome the opportunity to respond to D. Nails' (hereafter N.) views on stylometry (BMCR 3 [1992] 314-27). Let me open by emphasizing that we appear to agree about the value and importance of Thesleff's work (pp. 321-7). Whichever view one takes of Ledger's work, I join N. in encouraging scholars to read Thesleff's work carefully.

As to L., however, we disagree. I respond to seven details and then turn to the one large and serious issue.

First, I insist again that L. ought to have noted that previous workers in stylometry had considered orthography (i.e. counted letters in some way). It is true that L. seems to have been the first to apply this to Plato, but earlier work would have provided needed guidance. Moreover N. does not remark on the important part of this criticism -- L.'s choices are weak. To restate and correct a misprint: psi, xi, and even zeta could have been counted with sigma, the aspirates theta, phi, and chi with their psilotes, and he ought to have counted initial letters, prominent and inflected. The earlier work in counting letters is not irrelevant, as consultation of the works cited would show (N. 316-7 versus K. 422-3).

Second, I noted that L.'s discussion of the uncertainty is inadequate because he almost never gives the sigma (standard deviation) associated with any numerical value he computes, and that therefore one cannot know how likely it is that a difference between a pair of such values is significant. N. (p. 321 n. 1) advises a look in L.'s index under "standard deviation" (equivalent to sigma) and "significance" (a term I did use). We find L. pp. 35, 38, 176, 182-3, and 184-5 only. And it is still true as I said (pp. 466-7) that the only numbers given are on p. 185 and are inadequate. N. notes that significance tests are routine in SAS (the software package used by L.) -- why then did he not report sigma's or z's?

Third, although L.'s choice of variables has been criticised, and although his treatment of the uncertainty fails, it would still be of help for future workers to know which variables he found most useful. In the test attempting discrimination of the chronology of the dialogues (K. pp. 425-6) L. specifies in tables that the 'best' variables were primarily final letters, esp. eta, iota, nu, omega. In the test attempting discrimination of author (Plato, Aeschines, Thucydides, etc., K. pp. 424-5) he tells us that the 'best' results were obtained with all nine possible final letters alpha, epsilon, eta, iota, omicron, upsilon, omega, nu, and sigma plus the variables representing words containing one of alpha, gamma, delta, epsilon, eta, theta, iota, kappa, lambda, mu. My complaint, of which N. states that it "borders on the absurd" (p. 319 n.1), was that for the test attempting discrimination of authenticity within 'Plato', a chapter amounting to just about 1/3 of his text, L. fails to tell us which variables were 'best' (p. 425). This is of no little importance (L. makes rather specific and confident proposals), and not a bit absurd.

Fourth, N. complains that I "niggled" over which book of certain authors was used in the author discrimination tests (p. 318 n. 1). There were two such questions (p. 424), and it is true that with some flipping back and forth the answers can be found. Four of ten Thucydides samples used cluster with one from Aeschines (L. p. 47): L. has indicated (p. 41) that he has 49 Thuc. samples (comprising Books 3-5), that (p. 46) he used 10 of 16 [sic] samples, i.e., probably from Book 3 (as on p. 41 it had 16 samples), and in Fig. 6.1 (p. 45) the labels show that Thuc. samples 306, 309, 302, and 308 cluster with Aeschines Tim. (sample 2), while Thuc. sample 301, 303, 304, 305, and 307 are isolated. The other case is similar. Given all this, asking 'which' is no niggle: L. could have been clearer.

N. has misunderstood my point concerning non-linear effects (K. p. 425, N. p. 316, L. pp. 173-6). To take a hypothetical case, let us say that Plato avoided hiatus mildly when young, stopped bothering at all in his middle years, and then strongly avoided it in old age, with more or less smooth changes between. Of course, since we are in principle not in possession of the dates, only the hiatus (or whatever) data, what we would 'see' is a more or less smooth change between low and high rates of hiatus -- and we would not even know there was a non-linear change hiding in the data. Discovering such a non-linear effect in several variables (e.g., hiatus, plus rate of use of final nu, plus etc.) is even less likely -- such a dataset would rather seem muddled, giving different chronologies depending on which variables were used.

N. defends L.'s use of "three forms of cluster analysis to identify which segments [of text] belonged to which authors" (p. 317, vs. K. p. 424, L. p. 33). N. claims that L. "explained their advantages and biases for different types of data sets," which he does briefly (pp. 33-4). Our disagreement here may be clarified by recourse to an analogy. Let three different standardized multiple-choice IQ tests be given to a large population; the results show that persons of gender A have higher IQ than those of gender B. People who defend the results claim that since three different tests were used the results are more secure; those who attack the results claim that since all the tests were standardized multiple-choice they will have a common bias. All three of L.'s methods use a distance measure equivalent to that quoted in my review (Euclidean distance or its square). I stand by my claim, which was neither clumsy nor a distortion (a summary I grant).

Sixth, N. disagrees with my analysis of L.'s success in discriminating known authorship (N. pp. 317-8 versus K. pp. 424-5). I attacked these indeed precisely because they should have been "intended to provide evidence of some achievement," as even N. has just stated (p. 317): L. "well knew ... that it would be necessary first to establish the reliability of various types of multi-variate analysis by testing them on questions the answers to which he already knew." Moreover, as I will discuss further below, L. never did check whether his methods could discriminate chronology in a known case. L. is checking to see whether cluster analysis (cp. above) can be used with his chosen variables to discriminate author. Since each work he examined consisted of 4 to 16 samples (pp. 46, 54 = tables 6.2, 6.3), treated equally, perfect success would require that samples from different authors are always distinguished, while samples from any work of a given author are always clustered together. L. discusses why he prefers to calculate success based on percentage of samples classified correctly (pp. 48-9) and to base attribution on a "majority verdict" (p. 56), though he admits that "precisely how one should establish success in these circumstances is not at all clear" (p. 48). The computer of course does not name names but only classifies, so we must ask, in the case where the answer were not known, how much error could we tolerate? In the first case examined by L. four Thuc. samples (from somewhere in Hist. 3) cluster with one of ten Aeschines samples (from Tim.). If instead of seven works whose authors are known to be distinct (as here), we had seven works which might be by one to seven authors, this would be grounds for attributing Tim. and Hist. 3.x to the same author. Similarly, since all five samples from Erat. cluster with one from Hist. 3.x, perhaps we ought to assign that speech also to the corpus of the 'Auctor in Timocharem'. Since these 'strays' from Thuc. are not distinct from their cluster until a level in the tree at which individual works begin to break up, we would be justified in calling these three documents 'partly homogenous'. Thus I think three clusters of ten showing contamination is bad. I do not render that into a percentage because I do not think that percentage would be meaningful (contra N. p. 318, n. 1). The second example, including more samples and more works, was indeed introduced "to illustrate more effectively the difficulties met with in separating works from one another and the way in which the more advanced techniques of MVA deal with the problem" (L. p. 51), so an increase in error (by N.'s own accounting) is not encouraging. The third example is the one in which N. sees 99% accuracy -- I note that Hist. 3.x and Erat. are still 'attributed' to the same author by this test. N. fails to note the fourth test, in which Xen. and Plato were 'shown' to be the same author (K. p. 425).

The seventh and the last of the minor issues is the disagreement over L.'s report of his chronological results (N. pp. 319-20 versus K. pp. 425-7). I conflated "early-or-late" with "before-or-after" (a category invoked by N., p. 320, to refer to L.'s treatment of the external evidence and cross-references, pp. 82-6) because one has to -- L.'s method proceeds by assuming that (at least) Laws postdates Rep., and without some such assumption no before-or-after distinction could ever be made (so L. p. 88, n. 74). Certainly L. is not sloppy, but only records what he did, in Table 9.2 (N. p. 319, n. 3). L. discusses "temporal proximity" only in connection with this confusion over the non-linearity problem (pp. 175-7), so naturally I avoided introducing this erroneous point. It seems to me, though one cannot be sure from his presentation of his results, that L.'s results may agree with previous work in grouping certain dialogues in a late set (though note how many were assumed late, K. p. 426). But given their uncertainty it is surely too much to report Tim. and Crit. as post Laws (L. pp. 197, 200-5), a point on which L. expresses no hesitation: "the [stylometric] evidence nearly all points in the opposite direction [to the view of Owen], suggesting that not only are the Timaeus and Critias late dialogues, but that they were the last productions of Plato's pen, Critias being cut short by Plato's death" (L. p. 197), a view on which he insists at length (pp. 200-5). This is short of the announcement I called it, by an insignificant hair. Finally, L.'s presentation of the order of the middle group is mistaken not because he thereby ignores possible alternative arguments (N. p. 319, n. 3), rather it is of a piece with his failure to examine the uncertainties (above) -- it is misleading excess precision. N. in the end agrees (p. 321): "I remain unconvinced that stylometric analysis is the fine instrument L. requires for so detailed a linear sequence of dialogues."

The major point I have stated in my review of Brandwood, but it seems to need repeating, for N. underestimates its importance (p. 317). In the natural sciences, theories and methods are tested by deriving predictions from them about how things will occur in a given test or experiment. In philology we cannot repeat tests in the same way -- we cannot ask Plato to write more dialogues. Instead, in order to be as sure of our results as we can, we must take what data (texts and facts about them) as we have, and try different tests on them. But if the tests are to reveal chronology or authenticity in disputed cases, they must be well and widely tested on works of known author and date. L.'s letter-counting must be checked on various authors, Greek and Latin (with suitable alterations to the method), to see whether in fact it does tell us anything about chronology. L. only checked to see if it would discriminate authorship, and I believe he showed that it failed (or at least did not reliably work). But since even the design is flawed (above), it ought to be checked with an improved design -- that is, count initial letters, breathings, iota adscripts (or subscripts), sibilants with sigma, aspirates with their psilotes. Only when the design has been refined through several iterations, and the test checked with several authors to confirm that it (for some subset of the letters counted) is correlated with chronology would it be valid even in principle to apply it to Plato. This is the summary of my review article (in which L. originally held a place as the latest work): no stylometric work, however careful and proper in design and execution, can ever hope to be correct without these extensive control tests.

Paul Keyser
University of Alberta