It is common knowledge that there is a dependency, in NLP and IR, between a task, the performance of a technique and the properties of the data, or text on which it is deployed. The relationship between data and technique performance has never been explored systematically, in spite of the fact this causes methodological problems.
For instance, many experimental results are found to be not replicable on datasets other than the ones on which they were obtained. We have started exploring some of these dependencies. Our aim is to find, and verify, measures that are capable of revealing a bias in language data, that make it suitable (or unsuitable) for treatment with a given technique. We started with some naive sparseness measures to show that significant differences exist between languages, which are likely to affect statistically based techniques. We showed that within the same language - even within standard reference collections, some straightforward sparseness and homogeneity measures reveal differences between genres and document types that are likely to affect the applicability of techniques.
This is a "red thread" project. It started out as a by-product of running quality checks on our Arabic corpus, and on Bengali. It continues to draw extensively on the results of our other projects in order to add to the list of measures and experiments on a wide range of collections and techniques. Current experiments focus on term dependency measures and on discovering collection biases that affect the performance of LSA.