This Project's

Term Burstiness

Theme: Natural Language Processing
Terms do not distribute homogeneously in text, and the likelihood of a term re-occurring is much higher once a term has occurred once. In other words, content terms tend to occur in bursts. Standard term distribution models try to capture this in a variety of ways, all of which draw on frequency as the only useful measure of a term's behaviour. However, unless terms distribte homogeneously (which we know they do not), it follows that positional information too should be captured. We modeled the bursty behaviour of terms by measuring the gaps between successive occurrences of terms using a mixture of exponential distributions, and fitted it in a Baysian framework. The model shows clearly that all terms (including function words) behave burstily and behave differently in different collections. We are testing the hypothesis that fine grained models such as ours are capable of determining the role of words in different contexts. Possible applications include style detection, authorship identification, and discrimination of when a term occurs as a topical content word to improve precision in retrieval and filtering.