Does Sample Size Matter In Markov Chain Text Generators?
Matt Jain
Sir Winston Churchill Secondary
Floor Location : M 101 F

Markov chains are a well known statistical model that can be used to predict the state of a system based on its previous states. This can be applied to language and used to generate sentences based on a sample. This project explores how changing the word count of the sample used by a Markov chains text generator impacts the frequency of its output, based on word frequency. Word frequencies are one quantitive way of measuring language, as proved by Zipf's law. It was hypothesized that increasing the word count of the sample used would result in a closer match between the word frequencies of the sample and generated output of a Markov chain text generator. To test this hypothesis, samples were compiled from the British National Corpus; samples with word counts of 1000, 2000, 10 000, 50 000, 100 000, 500 000, 1 000 000, and 2 000 000 words were compiled. 5000 sentences were then generated from each sample, and the word frequencies of the 10 most common English words (based on wordcount.org) were calculated for the sample and output. The differences between sample and output frequencies for each word were averaged - a smaller difference between sample and output indicated higher accuracy. The results of the experiment showed that a greater sample size generally led to more accurate text; a larger sample size led to a smaller difference between the word frequencies of sample and output. Based on these results, it was concluded that the hypothesis posed at the beginning of the experiment was correct.