Ledger’s is the latest contribution in a long series of attempts to determine the chronology of Plato’s works by means of the relatively modern technique of stylometry—that is, one takes numerical measures of the manner of writing (‘style’ sensu inferiore), and in various ways converts these measures to a relative chronology (authorship has also been treated by this method). Ledger proposes to depart “from the traditional approach of stylometry by ignoring entirely meanings and grammatical functions, measuring instead the frequencies of words according to their orthographic content” (p. 2), and places most of his effort on determining the chronology of Plato (he does touch on authenticity). He seeks variables to count which will be easy to count by computer, are frequent, and are likely to relate to (i.e., be sensitive to) style (p. 4); words are rejected as insensitive if frequent, and rare if sensitive (p. 5); while orthography ought to pick out style well in a highly inflected language (p. 6). He counts words containing a specified letter α to ω, lumping the rare letters β, ζ, ψ, φ, χ, and θ together to get a larger count (Why not count φ and χ, and even ζ, with σ, or count them as double letters? Why not φ and χ with π and κ respectively?); he counts words ending with a specified letter (αεηιουω, ν and σ) and he counts words whose penultimate letter is a specified letter (αεηιουω, δ and τ). (Why not the initial letters of words, prominent and inflected?) These sets make up 37 variables (pp. 6-9). However he ignores all accents and breathings due to computational difficulty (p. 11)—but the spiritus asper and the iota adscript (or subscript) at least ought to have been counted (and he exaggerates the computational difficulty). He sampled the texts, whether of Plato or others, in 1000-word units (p. 5), in order to allow for the determination of variation within a work (p. 16), but these samples were so taken that they in fact covered most of the texts (p. 18). Although he considered checking for the effect of genre, he never did (p. 15).
So much for the preliminary methodology—hardly perfect, but interesting. The claim on the endpaper (“a revolutionary new approach … taking as the set of variables … the occurrence of particular letters”) is hardly justified, though Ledger only claims that “since style is such a nebulous concept … [it is] better, in fact, to count too much than to miss important facts through a harsh selection process” (p. 6) and that previous workers thought orthography useless “for detecting differences of style” (p. 49). It is important to establish the history correctly since Ledger does not. As Yule pointed out long ago, the medieval Masoretes (VIIII A.D.) were very fond of letter counting in the Hebrew Tenach,[1] and I would add that Lasos of Hermione in the VI B.C. and other Greeks wrote lipograms, at least one of which has been found (the asigmatic V or IIII B.C. work in P. Bodm. XXVIII).[2] Moreover, since Shannon’s development of a mathematical theory of communication (in 1948), based on letter-frequencies,[3] a large number of people have counted letters in various languages in various ways,[4] in some cases for the purpose of establishing or confirming chronology,[5] in others to establish authorship[6], or even to determine linguistic affinities (the assumption being that authors unconsciously seek or avoid certain sounds, i.e., letters).[7]
Ledger’s second step seems very wise—he selects some works of seven Attic prose writers of V to IIII B.C. (pp. 20-5): Aeschines, Isaeus, Isocrates, Lysias, Thucydides, Xenophon, and Plato. He then subjects 320 thousand-word samples (111 from Plato, or a bit less than one-fifth of Plato, p. 38) to various tests of multivariate analysis, of which he gives a survey (p. 43). His methods are all variants of or alternatives to cluster analysis (he tried three methods thoroughly and eight more partially). The method of cluster analysis is essentially to compute for each pair of samples the sum of the squares of the differences in corresponding variables. (If there were only two variables a and b, instead of 37, this would give d2 = (a1-a2)2 + (b1-b2)2 or the square of the distance (d) between the two variables a, b for the sample pair 1, 2: in fact this calculation is referred to as the “Euclidean Distance.”) Then these distances are ‘clustered’ by successively adding those samples to a group whose distance from the group average is minimum[8]. While his explanation of the statistical method (pp. 26-38) is quite good and clear though a bit superficial, he here draws an invalid conclusion, which tends to weaken his work: because all the methods he tried gave similar results he argues that the results are the more certain—but the methods he lists have been shown (p. 33) to be mathematically more or less equivalent, so the coincidence of results proves nothing. Now the clusters (or groups) formed of these samples are formed on the basis of the mathematical similarity of the set of letter-counts, and the working assumption is that these counts and their similarities are in some way measuring style with a view to establishing chronology and authorship. Therefore I would be very concerned if samples a priori presumed inhomogeneous are clustered together—for instance in Ledger’s example Xen. Oec. (all samples) clusters with “a stray one from [Plato’s] Pol.,” four (of ten) Thucy. samples (from Books 3-5, but which?) cluster with one from Aeschines, and Lysias (i.e., adv. Erat., the only oration used) clusters with “a stray from Thucydides” (p. 47): thus of ten clusters, three are contaminated. The conclusion should be—letter-counting alone does not give a sufficiently sensitive measure even of gross divergence in author and genre (he apparently used all 37 variables for this test: p. 47).
And it only gets worse. His next test succeeds only with Isaeus Orat. 3, while the other 11 works tried (Aeschines Tma., Isoc. Archid., De Pace, Paneg., Lys. Erat., Plato Apol., Pol., Rep. I, Thucydides 3, Xenophon Hell. [but which book?] and Oec.) are to some extent muddled up (pp. 51-5). His third test uses only the nine variables representing words ending with a particular letter (αεηιουω, ν and σ), but the same 111 samples (and 12 works) used for test #2 (pp. 55-8). But now he uses a technique which is roughly the inverse of cluster analysis—the samples are correctly grouped in advance and the mathematics is done to find a discriminant function (i.e., some combination of the nine variables) which gives distinct values when applied to distinct groups, but similar values when applied to samples within a group. Some 17 (i.e., roughly 15%) of the 111 samples are misclassified, and again the correct conclusion would be—it does not reliably work. Much better results are obtained by including ten of the variables representing words containing a specific letter (in particular αγδεηθικλμ —note four partial overlaps in the total of 19 variables)—only one sample from Thucy. “goes astray” and lodges with Lysias (p. 58). Limiting himself to oratory (including Plato’s Apol.), he finds Isaeus Orat. 1 with Isocrates, and Aeschines (Timarchus) is mixed with various Isaean orations (pp. 66-8). The last is worst (pp. 68-70): Plato Apol., Prot., Rep. I, and Xen. Mem. and Oec. are tested, and “apart from … the Apology, there is no clear evidence of individuality in any of the other works” (using all 37 variables). My conclusion would be—Ledger has clearly proven that his simple letter-counting scheme does not work.
When he focuses on Plato, he sets himself five goals: (1) to establish the probability of the authenticity of the Epistles (i.e., #2, 3, 7, 8, and 13 which only are sufficiently long: p. 78), (2) similarly, dialogues Alc. I, Epin., Hipp. Ma., (3) similarly, dialogues Alc. II, Amat., Clitophon, Hipparch., Hipp. Mi., Ion, Menex., Minos, Theages, (4) the general relation of the dialogues, and (5) to establish an approximate chronology for all the dialogues (p. 75). I shall concentrate on #5. He sets out very clearly what he takes to be the external evidence and cross-references for dating: pp. 82-9. He will use this rough framework to guide his subsequent work, but the only feature explicitly included is that Laws is after Rep. (p. 88, n. 74).
Although it is not my purpose here to examine questions of authenticity, Ledger’s lengthy chapter thereon (pp. 92-169) is important, for he finds it difficult to discriminate by his method between Thucydides and Xenophon, or Plato and Xenophon (p. 93). Moreover Xen. Hell. is sufficiently different from Xen. Mem. + Oec. that the two sets appear to be different authors (p. 96). He reassures himself (but not me) that for “an entire work … we need only to accept a simple majority verdict” of its samples (p. 99). He attempts to improve his discriminant function (cp. above) by selecting the subset of the 37 variables which gives the best discrimination between Plato and Xenophon (he ought to have told us which variables these were), but this only throws the rest into a worse muddle (p. 115). Or again, Xen. Mem. is more like Plato than it is like Xen. Oec.—due to genre (p. 160). He more or less admits that his test cannot fathom authorship (p. 161), and doubts whether it can “cope with” genre (the Menexenus is an epitaphios and very unlike Plato by Ledger’s measures: p. 163). Even the Parm. is most unplatonic by Ledger’s measures and is “the one dialogue which would have to be rejected if it were necessary to eliminate one which was most untypical [of all attributed to Plato]” (p. 164).
After such “an exceedingly difficult chapter” (p. 168), it is at least impressive that Ledger pushes on. He wisely points out that the usual assumption of linear change in a stylistic feature is unverifiable, and the change might be for example a parabola (pp. 173-5)—an important point which few stylometricians seem to have considered. But he almost immediately abandons the point (p. 176): “we may be fairly sure that proximity of works indicates proximity in date.” He then selects (pp. 178-9) four sets of dialogues of whose chronological order he is relatively sure (e.g., “A” = Rep. before Laws, “D” = Gorg. and Prot. before Pol. and Soph.), but he attaches more importance to sets “B” (Gorg., Phdo., Prot., Symp. before Phil., Pol., Soph., Tim.) and “C” (Apol., Charm., Crito, Euthd., Gorg., Laches, Lysis, Menexenus, Phdo., Prot., Rep., and Symp. before Laws, Phil., Pol., and Soph.). On these he uses the method of finding the first canonical variable (that combination of the original, here 37, variables which is most correlated with the variation observed). Since sets A-D have been used as described, this canonical variable will be well correlated with the presumed chronology—whether or not the presumed chronology is correct (as he notes p. 179). The process (except with set A) tends to select the variables counting final letters (thus B selects ινω final as the three most important contributions to the first canonical variable, C selects ηινω final as the four most important, and D selects ν and ω final as two of the three most important: pp. 180-2). He also notes correctly that the value of the first canonical variable for each dialogue has a rather large uncertainty, which makes discrimination very iffy (pp. 184-5).
Yet when he announces results, he emphasizes the consistency of his results, not their uncertainty: even error may be consistent. The list of dialogues which his ‘results’ tell him are late is composed mainly of dialogues which he assumed were late: Phil. (late in lists B, C), Soph. (late in B, C, and D), Pol. (likewise), Laws (late in A and C), Tim. (late in B), plus Clitophon, Epp. 3, 7, 8, Epin., and Crit. He notes that the late group is “self-contained” (p. 187) because the numerical values assigned by the procedure are well separated from the rest, but he has failed to tell us a very important and very elementary fact—what is the statistical uncertainty (symbolised σ) of the comparison?[9] From his inadequate discussion (p. 185) we might conclude that the σ of each value was 1 or so;[10] if so, the Menexenus is not statistically different from the ‘late’ set. For example, set A, using 10 variables, gives Phil. 1.4887 (an illusory and misleading precision), and Menex. 0.7483 (Table 9.5, p. 188), so the difference is 0.74, which when divided by the correct σ for this comparison (1.4) gives the z-score, here about 0.5, which is hardly visible, let alone significant[11]. This analysis is very rough, because Ledger has not given the necessary σ’s. That is a damaging omission. In general, the Menex. often falls close to or within the ‘late’ group (for all of set A, set B with 6 variables or more, all of set C, all but one of set D); similarly the Parm. 2 (the “Eleatic part”, pp. 165, 212) is sometimes near or within the ‘late’ group (usually with fewer variables used).
Despite this uncertainty, Ledger is willing to announce such details as (pp. 197-209) that Tim. and Crit. follow Laws, “Crit. being cut short by Plato’s death” (p. 197: it may be so for all I know—Ledger’s results surely do not prove it)[12]. Ledger expresses wariness about the position of Menex., but not on statistical evidence, rather on arguments from genre (pp. 210-2). For the Parm. he makes similar excuses due to subject (pp. 212-3), and revealingly remarks “I suspect that any other dialogue of comparable length and belonging to this or an earlier period, if split into two sections, would show only a marginally improved variation” (p. 213)—but he has not checked this crucial point. Although he notes that the results for the middle group are so close together “that random fluctuations are liable to cause considerable variation” (p. 213), he is then willing to follow the numbers precisely (p. 217) and to conclude that the order is Gorg., Meno, Charm., Apol., Phdo., Laches, Prot., Euthd., Symp. (pp. 217, 223-4).[13] Of course, it turns out that even the mathematical order varies considerably, and for the early group Ledger gives up and visually assesses the data (p. 219) to obtain Lysis, Euph., Minos, Hipp. Mi., Ion, Hipp. Ma., Alc I, Theages, Crito (which involves him in various apologiae with respect to the Apol.: pp. 221-2); of the middle group he only says (p. 223) “it represents the aggregate of all the evidence the stylometric analysis has to offer.”
The conclusion must be that despite a noble attempt, the work is carved from rotten stone and crumbles at the touch. I would not necessarily conclude that letter-counting itself is to be abandoned, but Ledger’s results (when correctly analysed) are useful to show that it seems rather fuzzy. Perhaps some variation of the method could work, and perhaps some subset of Ledger’s data is useful for chronology (or even authorship). But this could only be known if the more fundamental question of methodology were first attacked—what are the tests of stylometry telling us and with what uncertainty? Quomodo ipsae probationes probari possunt?
[1] George Udny Yule, The Statistical Study of Literary Vocabulary (Cambridge 1944) 7-8. Greeks and Latins assigned numerical values to letters: P. Perdrizet, “Isopséphie,” RÉG 17 (1904) 350-60.
[2] See Eric G. Turner, “P. Bodmer XXVIII: A Satyr-Play on the Confrontation of Herakles and Atlas,” MH 33 (1976) 1-23; a reference I owe to Wm. M. Calder III.
[3] Claude E. Shannon, “A Mathematical Theory of Communication,” Bell System Technical J 27 (1948) 379-823, 623-656.
[4] C. E. Shannon, “Prediction and Entropy of Printed English,” Bell System Technical J 30 (1951) 50-64; N. G. Burton and J. C. R. Licklider, “Long-range constraints in the statistical structure of printed English,” AJPsych 68 (1955) 650-3 (letter-frequencies are correlated up to a distance of 32 letters apart in text, not longer); Gustav Herdan, “An Inequality between Yule’s Characteristic K and Shannon’s Entropy H,” Zeitschr f Angewandte Math u Phys 9 (1958) 69-73 (connecting letter-frequency and word-distribution); and Mario Grignetti, “A Note on the Entropy of Words in Printed English,” Information and Control 7 (1964) 304-6 are a sample of the first twenty years.
[5] Suzanne Govaerts, “Les Initiales de Vers chez Lucrèce et Virgile,” Statistique et Analyse Linguistique, edd. Ch. Müller and B. Pottier (Paris 1966) 41-9—Aen. later than Georg.
[6] Wm. Ralph Bennett, Jr., Scientific and Engineering Problem-Solving with the Computer (Englewood Cliffs, NJ 1976) 111-46, a reference I owe to Hugh G. Robinson (Duke, Physics) who introduced me to this book and problem in the very pleasant summer of 1978.
[7] Alan S. C. Ross, “Philological Probability Problems,” J Roy Stat Soc (B) 12 (1950) 19-59 @ 19-30.
[8] J. H. Ward, “Hierarchical grouping to optimise an objective function,” J Amer Stat Assoc 58 (1963) 236-44; see also any of various textbooks, esp. Brian Everitt, Cluster Analysis (London 1974), and R. Sokal and P. H. A. Sneath, Principles of Numerical Taxonomy (San Francisco 1963), 2nd ed.: Numerical Taxonomy: The Principles and the Practice (San Francisco 1973).
[9] For the definition and explanation of ‘uncertainty’ see any of numerous statistics books; e.g.: Anthony John Patrick Kenny, The Computation of Style (Oxford 1982); Murray R. Spiegel, Theory and Problems of Statistics (New York 1961) in the Schaum’s Outline Series; or John Robert Taylor, An Introduction to Error Analysis (Mill Valley CA 1982). A simple example case will help: in computing the average of a set of values (say scores on an exam), there is an associated uncertainty (roughly half the width of the ‘bell-curve’), called σ, computed: σ = Σ (i.e., the summation sign) (Xi – A)/N, where the X’s are the N values, and A is the average.
[10] He gives only two examples, for a test involving 9 variables, of 1.29 and 0.60, whose average is 0.945.
[11] See any of the books listed above (e.g., Spiegel 167-71) for an explanation of this z-score and of the method of computing the significance of differences between two values each having an uncertainty. The z-score is usually converted to a probability of significance (see Spiegel 343, Taylor 245, or Kenny 171 for tables of percentage points for z-scores). The z-score of 0.5 gives a probability of 38%, but it is important to note that the usual statistical procedure is to regard anything less than 95% (z = 2.0) or even 99% (z = 3.0) as not significant. In any event, 38% is negligible.
[12] He tries to confirm the order Crit. after Tim. by noting his results say so “in about 90% of the lists” (p. 197)—in fact the order is reversed 8 times, out of 72 lists, but the lists are scarcely independent since, e.g., the first two use the first 10 and the first 9 canonical variables derived from set A.
[13] He does also insert Menex. after Gorg. (“somewhat arbitrarily” p. 212) and remove Crito from its mathematically-assigned place roughly after Gorg. to before Gorg.