Notes

1. Provided all expected values are over a threshold of 5. Where there is just one degree of freedom, Yates' correction is applied.

2. Strictly, word-form-and-part-of-speech lists; the BNC is part-of-speech tagged, and part-of-speech distinctions were retained in the lists. No lemmatisation was carried out. All experiments reported in this paper were performed on such <wordform, POS> lists.

3. The log-likelihood statistic (Dunning, 1993) would have the same advantages, and is, mathematically, a more appropriate test. We shall consider using it for future experiments. It has not been used in the current trials because it is more complex to compute and, where expected values for word frequencies are over 5 and the probability of the next word being the word of interest is less than 1 in 50, the difference between chi-squared and log-likelihood is very small. These two conditions hold for all the data (except the data for the, of, and and a) that we are using. For a survey of statistical approaches, see Kilgarriff (1996).


References

Dunning, T. 1993. "Accurate methods for the statistics of surprise and coincidence." Computational Linguistics. 19(1). Pp 61--74.

Hofland, K. and Johanssen, S. 1989. Frequency analysis of English vocabulary= and grammar, based on the LOB corpus. Oxford: Clarendon.

Kilgarriff, A. 1996. "Which words are particularly characteristic of a text? A survey of statistical approaches." Proceedings, ALLC-ACH '96. Bergen, Norway.