My primary research interest is in information extraction, particularly information extraction from the legacy scientific literature. My research focus has and remains to be biodiversity informatics so I have been investigating the problems and challenges surrounding information extraction from the old biodiversity literature.
In the past five years I have won four grants, worth approximately £590,000 in this research area. Two of these grants have been funded by JISC, the other two by the EU, so I am experienced in working on nationally and on internationally funded research projects.
The research challenges that are being investigated by these projects include:
Big data - our source data is the biodiversity literature. The legacy, print, literature in this domain has been estimated to run to 300 million pages.
Noisy data - Optical Character Recognition (OCR) errors introduced during the scanning process means that up to two thirds of named entities (e.g. scientific names) are spelt incorrectly; simple spell checking or look up against an authority is not sufficient to address this problem. For example 'Homo', the genus name for humans, can be mis-interpreted by an OCR engine as the butterfly genus 'Homa', so the context of use is very important.
Disambiguation - taxonomic nomenclature calls for unique names only within Kingdoms, hence there is a bacteria genus 'Bacillus' and an insect genus 'Bacillus'.
Domain specific terminology - the domain makes extensive use of terse language, abbreviations and special characters such as male ♂ and female ♀, and mixes Latin formal descriptions with vernacular text.
The four grants that David has won recently are:
A Community-driven Curation Process for Taxonomic Databases. JISC Digital Infrastructure Programme: Managing Research Data call, for £85,902.
A data infrastructure to support agricultural scientific communities. Promoting data sharing and development of trust in agricultural sciences. EU Seventh Framework Programme , Capacities – Research Infrastructures. Principal investigator at the OU. The total budget is €4 million, with the OU share being £222,745.
Virtual Biodiversity Research and Access Network for Taxonomy. EU Seventh Framework Programme , Capacities – Research Infrastructures. Principal investigator at the OU and Workpackage leader. The total budget is €4.75 million, with the OU share being £207,685.
Automatic Biodiversity Literature Enhancement. JISC Digitisation Programme: Enhancing Digital Resources call, for £73,261.