Systematics Section / ASPT
Heidorn, P. Bryan , Yin, Qin Wei , Beaman, Reed S. , Cellinese, Nico .
Learning by Example: Machine Learning and Herbarium Label Digitization.
Supervised machine learning (SML) techniques, learning by example, can be used to transform herbarium specimen label data to digital format. In the HERBIS project the objective of SML is to make a computer system that can recognize patterns in the optical character recognition (OCR) output of scanned herbarium labels, and convert them into 36 XML components including, for example family, genus, species, author, variety, location, collection date, annotations, and others for convenient ingest into museum databases. To accomplish this, the human trainer gives the computer properly classified examples to learn from. The computer generalizes from these examples to properly extract information from previously unseen examples. While a computer can do this very well, never forgetting an example that it has seen, like a savant child, the computer cannot recognize something it has never seen before. For example, the determiner on a label might be indicated by “Determiner:”, “DET” or “Det.” all of which are different from the point of view of the computer. Therefore, it is the job of the human trainer to carefully provide examples that are representative of the future tasks that the computer will be asked to perform including in conditions of OCR error. The trainer must tell the computer how to classify strings like “DFT:” where a faded “E” was misread by the OCR as an “F” as well as other numerous but systematic errors. Using a combination of Rote Patterns Learning, NaÃ¯ve Bayes classification, Hidden Markov Models, and other techniques Herbis reaches high accuracy on some elements but not all. Through improvements in the algorithms and improvements in training examples performance is being enhanced. With a little practice, botanists can learn to provide training examples for the computer to allow the HERBIS SML System to efficiently convert herbarium label data to database format.
Log in to add this item to your schedule
HERBIS Home Page
HERBIS Development Page
HERBIS Parsing Demo
1 - University of Illinois at Urbana-Champaign, Graduate School of Library and Information Science, 501 East Daniel St. MC-493, Champaign, Illinois, 61820-6212, USA
2 - University of Illinois, Graduate School of Library and Information Science, 501 East Daniel Street, Champaign, IL, 61820, United States
3 - Yale University, Peabody Museum of Natural History, Botany Division, Po Box 208118, New Haven, Connecticut, 06520-8118, USA
Presentation Type: Oral Paper:Papers for Sections
Location: Continental C/Hilton
Date: Tuesday, July 10th, 2007
Time: 11:00 AM