UPENN Biomedical Graduate Studies

Fernando C. Pereira

Contact information

305 Levine Hall
3330 Walnut Street
Philadelphia, PA 19104

Office: 215-573-5041
Fax: 215-898-0587

Email:
pereira@central.cis.upenn.edu

Publications

Search PubMed for articles

Links
Search PubMed for articles
Department of Computer and Information Science faculty webpage.

Education:

University of Edinburgh, 1982.

Permanent link

SOM Home » BGS Home » Faculty

Description of Research Expertise

Research
My main research goal is to develop machine-learnable models of language and other natural sequential data such as biological sequences. Penn, with its strong and growing machine learning group, is the ideal place to pursue that goal. My most recent work has been on finite-state models for text information extraction and speech recognition, but I am also interested in information-theoretic approaches to inducing compact representations of multivariate data, and on bridging the gap between distributional and logical views of natural-language syntax and semantics.

Conditional probability models for information extraction and segmentation
Many sequence-processing problems involve breaking it into subsequences (person name vs other), or labeling sequence elements (parts of speech). Existing probabilistic methods for these tasks, in particular HMMs, have difficulty in dealing with multiple overlapping features of the input sequence. Maximum entropy Markov models were a first approach to this problem, which have now been superseded the more powerful conditional random fields. We are applying these models to text information extraction and gene finding.

Finite-state speech processing
What do regular expressions turn into when we need to assign weights (maybe probabilities) to alternative matches, and to compose pattern matchers? Weighted finite-state transducers. At AT&T, I was involved in developing these as a framework for speech recognition, leading to a creation of a powerful library that has been made available for non-commercial use.

The information bottleneck
How does one quantify the notion of information about something? Given some variables of interest, sources of information about those variables can be compressed while preserving the information about the variables. The tradeoff between compression and information preservation, which we call the information bottleneck, answers the question. Using this model, we can build compact representations of complex relationships, for instance word cooccurrences in text.

Formal semantics of natural language
The syntactic structures of natural-language sentences and their meanings must be linked by a systematic, compositional process for language learning and use to be possible. However, this form of compositionality is more subtle than those used in logical and programming languages. Linear logic turns out to be a good metalanguage to describe the natural-language syntax-semantics mapping.

Selected Publications

David McAllester, Michael Collins, and Fernando Pereira: Case-factor diagrams for structured probabilistic modeling. Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence July 2004.

Ryan McDonald and Fernando Pereira: Identifying gene and protein mentions in text using conditional random fields. Critical Assessment of Text Mining Methods in Molecular Biology. European Molecular Biology Organization, 2004.

Ryan T. McDonald, R. Scott Winters, Mark Mandel, Yang Jin, Peter S. White, and Fernando Pereira: An entity tagger for recognizing acquired genomic variations in cancer literature. Bioinformatics 2004 Notes: To appear.

Fei Sha and Fernando Pereira: Shallow parsing with conditional random fields. Proceedings of HLT-NAACL 2003. Association for Computational Linguistics, Page: 213-220, 2003.

Mehryar Mohri, Fernando Pereira, and Michael Riley: Weighted finite-state transducers in speech recognition. Computer Speech and Language 16(1): 69-88, 2002.