Bryn Mawr Classical Review 04.02.19


Keyser Responds Concerning Frischer's Work


I would like to take this opportunity to respond to the seven substantive points raised by the group of four scholars Clayman, Crane, Guthrie, with the assistance of Frischer, in BMCR 4.1 (1993) 75-81 in criticising my printed review at BMCR 3.2 (1992) 118-22. Before doing so, I must make three important prefatory remarks. Based on a letter from one of them and the tone of their defence of F. it would seem that some phrases in my review were taken personally. Let me hasten to assure the group (and any others who may have been similarly misled) that I have never at any time borne any ill-will toward F. (or stylometry per se), and deeply regret any such impression I may have conveyed. Second, the group (pp. 76, 81) suggests that the ideal reviewer would have been competent both in stylometry and in Horatian studies. A noble principle, but something of a reductio as for all but the most general works the most competent and so chosen reviewer would then be the author. Third, the group suggests (p. 77) that the "most effective" procedure would have been for me to propose an alternate date on stylometric grounds. Again a reductio as normally a reviewer will not rewrite the reviewed book. What is done (and was done) is to point out how it could or should have been done. And since my fundamental methodological point is that for neither of F.'s two main positive stylometric attempts (rare "strings" and frequent "function words") is there sufficient evidence from control studies that they are effective at all, I naturally saw no point in performing an improved version of either.

The first and longest substantive point made by the four scholars (pp. 76-7 and 77-8) is in essence that I criticise certain of F.'s stylometric arguments out of their context, that context being the remainder of F.'s book. My specific criticism (p. 122) was as follows. F. (pp. 47-8, 155-8) varies the date of the Ars "from its earliest to its latest possible year of composition (i.e., 24 to 8 B.C.)," and for each of these 17 years fits to a parabola the rate of use of the four frequent words ad, nec, per, and sed in all six hexameter poems. The best fit is chosen in each case and compared to the worst fit in that case. My point was and is simply that such a procedure ought not to produce absurd results and that dates earlier than 24 B.C. ought to have been checked. Note that in Tables Q-T (pp. 155-8) for these words one can validly eyeball the fact that dates later than ca. 15 B.C. will give increasingly poor fits, as will dates earlier than ca. 50 B.C. This confirms the absence of any absurd results (some impossible results are not excluded by this simple procedure however). Further, the fact that the best dates for all but ad are the earliest allowed date suggests that the data may in fact be represented best by even earlier dates and that the agreement of these three on 24 B.C. is spurious.

I reiterate that the fits are in any case not very good, as again can be validly determined by inspection of the graphs (F. pp. 155-8). Taking the dates used by F. (p. 44) and the rates of use of per and sed (p. 33) I have repeated his regression analysis for those two words. For sed the worst fit was 11 B.C. (r2 = 0.205) not F.'s reported 8 B.C. (for which I found r2 = 0.247), while the best fits were the four years 25-6 B.C. and 32-3 B.C. (r2 = 0.821), the latter just as I eyeballed it (p. 122, n. 1). For per the worst fit was the years 8-9 B.C. (r2 = 0.210) while the best fits were for 21 and 36 B.C. (r2 = 0.641 -- my eyeball value of 37 B.C. has r2 = 0.639). In both cases the disjoint pairs of best fits are to be expected for a parabolic trend line; also to be noted is that the r2 does not greatly decrease for intervening years (for sed r2 over 35 to 23 B.C. is greater than 0.814, cp. the peak values of 0.821, and for per r2 over 43 to 18 B.C. is greater than 0.594, cp. the peak values of 0.641). Now r2 > 0.81 for 6 points (sed) is a pretty good fit, but r2 < 0.64 (the other three words' best fits) is unconvincing (below).

Thus it is appropriate to examine earlier dates; no "absurd questions" are being asked; and F.'s attempt to use words of high frequency as chronological markers does not succeed in narrowing the range of dates. F. might have argued as follows. If such word frequencies mean anything (which itself has never been demonstrated in Greek or Latin), then the rate of sed at least is more consistent with an earlier date (ca. 41 to 20 B.C.). Such an argument would bear the weight of its conclusion (below).

Four further briefer points raised by the four scholars (pp. 78-9) also concern F.'s attempts to use the frequent words (ad, nec, per, and sed). It seems that my suggestion that a x2 test involving conventional dates of the poems and rates of use was misunderstood due to my brevity (BMCR 3.2 [1992] 122, vs. 4.1 [1993] 78). To clarify: first, take the conventional dates of all but the Ars (F. p. 44) in (say) six five-year wide "bins" as the independent variable; second take the six average rates of use (with adjustments for genre?) for all poems in a given bin as the six observed values of the dependent variable; and (the point I apparently failed to make sufficiently explicit) using some model (presumably including the effect of genre) generate a set of six predicted values to which the observeds would be compared, using the x2 test. A perfectly ordinary and proper use, in all the books (e.g., J. R. Taylor, Introduction to Error Analysis [San Francisco 1982] 224-6) ever since Karl Pearson developed the test (Phil Mag 50 [1900] 157-75).

Second, as noted above, my claim that the goodness of fit does not greatly vary and that, for most of the cases considered by F., it is not ever very good, is in fact quite precise and accurate (BMCR 3.2 [1992] 122, vs. 4.1 [1993] 78-9). It is not true that the values of r2 (the square of the ordinary correlation coefficient) "for the middle period [are] from 130 - 330 % better than [those] for the late period." Rather, they are in ratios of 1.3 to 3.3, sc. "30 % to 230 % better", but in fact that ratio does not mean that. The r2 is a dimensionless number from which one derives the probability that the N pairs of data-points if in fact uncorrelated would show by chance a value of r2 as great (but no greater) than that obtained. For a three-parameter fit to six pairs of observeds the probability P of obtaining r2 < 0.64 is < 90 %, which is usually considered the limit below which significance is not to be claimed. For sed, the best r2's of 0.814-821 correspond to P's of about 96.5 %, or equivalently the 95 % confidence interval would be 41 to 20 B.C. (r2 > 0.77 or so), which could fairly be expressed as 30.5 ± 5.4 B.C.

Third, my criticism of F.'s method of establishing that for four of sixteen very frequent words the three-point trend lines of rate of use in the three books of lyrics can be said to match the pattern of the corresponding five-point trend lines in the five hexameter poems (BMCR 3.2 [1992] 121-2, vs. 4.1 [1993] 79). To eyeball something of this sort as an initial guideline is quite useful and appropriate, but that visual inspection and analysis must be followed by something more convincing. It may have been reckless of me to go so far as to call F.'s lack of any more rigorous method "mad," but I note that neither F. in the book nor the four scholars (including F.) ever explain, even in a short summary paragraph, what that more rigorous method was. Worse than reckless, my criticism, if faced with rigor, would simply be wrong -- yet nowhere do they ever tell us in what way more rigorous than visual inspection the matchings were established. Until this is done, I do not see that there is any compelling reason to accept the matches.

The fourth point is my criticism of F.'s use of frequent words without adequate preliminary control experiments. Indeed "the concept has many ... potential uses" (p. 79) but F. has not shown that that potential can here be actualised. Until that is done an r2 of even 0.99 would remain unsupported, possibly spurious, and hence unconvincing.

The last two criticisms (pp. 79-80) concern my views of F.'s attempts to use rare "strings" (review pp. 119-20). I am sorry that my attempt to use a more precise and correlative terminology ("grapheme -- lexeme") for the usual awkward jangle ("string -- lexeme") confused Clayman et al. (no other readers known to me had any difficulty). I explicitly defined precisely what was meant (which is just the same as F. meant) and the formation is perfectly regular: TO\ GRA/FHMA = 'that which is written'. In any case, even though it may be "perfectly normal" in stylometry to count "strings", my argument that it may be invalid stands -- in a highly inflected language counting grapheme-words instead of lexeme-words muddles vocabulary and grammar.

Moreover, my criticism of F.'s numerical result here does not involve claiming that there is not sufficient evidence of a non-zero slope in the rate of "unique strings" (pace Clayman et al.). As I clearly stated, that F. has probably established. My criticism, from the true understanding of which an electronically-generated misprint seems to have deflected all four scholars, was and is that F. calculates, from the line fitted to the rates of use of "unique strings" vs. conventional dates for the five hexameter poems, a date for the Ars (so far so good), but fails to include the necessary uncertainty on that value. That I provided, and when understood correctly, it shows that the derived date should be given as 20 ± 5 B.C., just as I did. The date is still not "highly reliable"; it is relatively uncertain as F.'s adjusted r2 indicates a marginal correlation (as above, P is ca. 87 %), and even so the 95 % confidence interval is a decade wide.

There are however further problems and alternatives. First, when checking the fit for the three books of lyrics, I find not F.'s reported slope of 0.15 ± 0.01 %/annum, but the quite different 0.24 ± 0.02 %/annum (ca. 4 standard deviations, P > 99.9 %). Second, F. calculated the date of the Ars from his fitted line for hexameter poems using y = 66.1 % (instead of his own determined value of 66.5 %) and the fitted slope and intercept only; including the uncertainty due to the slope is necessary and gives the ± 5 quoted, using Sat. I (x, y = 718 AUC, 52.8 %) as the other point. One may instead use the centroid point, (x),(y) = 729.8 AUC, 62.06 %, as the other point, but while this will produce the minimum possible uncertainty, it shifts the date: 18.4 ± 1.7 B.C. This is now formally inconsistent with the result 30.5 q 5.4 B.C. obtained from sed (above), the deviation (z = 2.14) having a P of 96.8 % of being significant.

In conclusion I note that I too would have wished more mathematics to be present in my review(s), but the editors very properly removed a great deal as being too lengthy and inappropriate for a journal of this sort. I would be genuinely delighted to discuss further and with full mathematical apparatus any aspect of F.'s book, with any of the four scholars (or anyone else), chartaceas per litteras. Moreover there are positive results of F.'s work. He has surely demolished Duckworth; that is of value. If sed (with others?) is indeed a possible chronometer (which has not been established), then (as noted above) when suitably altered F. may have shown that based on the rate of use of sed in hexameter poems, the date of the Ars may be described as 30.5 ± 5.4 B.C. And thirdly, if it is valid to count "strings" (grapheme-words) instead of lexemes (which has also not been established), then (as noted above) when suitably altered, F. may have shown that based on a marginal fit to the rate of use of "unique strings" in hexameter poems, the date of the Ars may be described as 18.4 q 1.7 B.C. (inconsistent at P = 96.8 % with the result from sed).

The four scholars encourage me to make a "case ... for an another date on stylometric grounds" (p. 77): I have done so above for F.'s two main results, and I now briefly consider one further possible alternate interpretation of F.'s data. The rate of increase in use of "unique strings" appears to be quite similar (by eyeball) in the two groups (a) the lyrics plus Epist. II and (b) the Satires plus Epist. I (most easily seen by plotting the points of Table XIV, p. 45 on Table XV, p. 46). In fact when fit ted the slopes turn out to represent an increase in the rate of use of "unique strings" of respectively 0.24 ± 0.04 %/annum and 0.26 ± 0.08 %/annum (with r2 values of 0.908 and 0.920 respectively, giving P's of ca. 98 % with three degrees of freedom and 88 % with one degree, also respectively). But the overall rate is quite different between the two sets (offset = 13.8 ± 2.4 %). F. interprets these data as showing two quite different rates of increase (0.92 ± 0.35 %/annum from a moderately good fit for the hexameters, and 0.15 ± 0.01 to be corrected to 0.24 ± 0.02 %/annum for the lyrics), distinguished by metrical genre. But he has not shown that that is the correct interpretation, nor that others are excluded. On my alternate interpretation, the data cannot give a date for the Ars at all, as it may be in a third category with a different offset (if we force it into the first category, the calculated date with uncertainty, using the centroid point, is 41.8 ± 3.9 B.C.). This all suggests that the data may be revealing that rate of change in use of "unique strings" may be an authorial attribute independent of relative date and genre within corpus. That should be tested.

The final words here, like those of the Laws to Socrates in the Crito, keep ringing in my ears: until adequate control tests have been performed on a given stylometric test (whether rate of use of frequent words, or my suggestion above concerning rate of change in use of "unique strings"), there are no good grounds for trust in it. Shifting Paradigms is still shifting sands; but sand bonded with the cement of control tests may well make a strong mortar.