Sanda Harabagiu
Professor and Erik Jonsson School Research Initiation Chair
Department of Computer Science
University of Texas at Dallas
Richardson, TX 75083-0688
Office: ECSS-3.411
Phone: (972) 883-4654
Fax: (972) 883-2349

Current Sponsored Research Projects:

Automatic discovery and processing of EEG cohorts from clinical records (Collaboration with Temple University)

Electronic medical records (EMR) collected at every hospital across the country collectively contain a staggering wealth and breadth of biomedical knowledge. This information could be transformative if properly harnessed. However, before archival medical records can be mined, they must first be wrangled. This is a challenging problem because of the multimedia nature of the data (e.g., EEG signals, MRI images) and the complexity of the language used to interpret this data and assign diagnoses or suggest the medical problems based on the conditions observed. Information about patient medical problems, treatments, and clinical course is essential for conducting comparative effectiveness research. Uncovering clinical knowledge that enables comparative research is the primary goal of this NIH-funded Project, part of the NIH Big Data to Knowledge Program.

In this project, we tackle the processing of a big data of EEG reports. Clinicalelectroencephalography (EEG) is the most important investigation in the diagnosis and management of epilepsies. In addition, it is used to evaluate other types of brain disorders, including encephalopathies or neurological infections, Creutzfeldt-Jacob disease and other prion disorders, and even in the progression of Alzheimer's disease. An EEG records the electrical activity along the scalp and measures spontaneous electrical activity of the brain. The signals measured along the scalp can be correlated with brain activity, which makes it a primary tool for diagnosis of brain-related illnesses.

We have developed research that enabled us to automatically process the clinical language in all EEG reports that document 25,000 sessions and 15,000 patients collected over 12 years at Temple University Hospital. We have developed novel methods of identifying in the EEG reports the EEG activities, EEG events and patterns as well as their attributes. In addition to the EEG-specific medical concepts, we have also identified all medical concepts that describe the clinical picture and therapy of the patients. This enabled us to generate a novel index of the big EEG data. Moreover, because the EEG data is multi-modal, we have created a multi-modal patient cohort retrieval system described. Indexing EEG clinical information is complicated by the fact that it must organize both medical concepts extracted from the EEG reports with natural language processing techniques and EEG signal data originating from the EEG recordings. While the EEG reports are clinical texts, organized into sections, the EEG signal recordings use data formats that capture the magnitude of the electrode potentials along EEG channels during at many time samples. We created a novel EEG index which captures multi-modal clinical knowledge processed both from the reports and the signal recordings. While medical language processing enabled the indexing of information form the EEG reports, deep learning methods enabled the representation of EEG signal recordings.

Scalable EEG interpretation using Deep Learning and Schema Descriptors (Collaboration with Temple University)

Identification of epileptiform activities, seizures and the specific EEG patterns that accompany epilepsy syndromes remains an electroencephalographer's most critical task. The epileptiform activities are informative for detecting and predicting seizures. The definition of epileptiform activity, as provided in the Chatrian glossary of terms as distinctive waves or complexes, distinguished from background activity and resembling those recorded in a proportion of human subjects suffering from epileptic disorders. These waves or complexes can appear as isolated focal spikes or sharp waves, generalized polyspike, spike and wave or paroxysmal fast activity, and sometimes as abrupt rhythmic evolution of the background that heralds seizures. EEG signals record both epileptiform activities and EEG events. While the Hierarchical Event Descriptors (HED) (available from have defined many types of EEG experimental events, no existing components of standardize the epileptiform activities and their attributes. We filled this gap by generating a schema of Hierarchical epileptiform Activity Descriptors (HAD). Similarly to the Hierarchical Event Descriptors (HED), we generated a hierarchical structure for the Hierarchical epileptiform Activity Descriptors (HAD), while organizing hierarchies for (1) the epileptiform activity waveform; (2) the epileptiform activity frequency band; (3) the epileptiform activity anatomical location; (4) the epileptiform activity position; (5) the epileptiform activity distribution; (6) the epileptiform activity frequency; (7) the epileptiform activity magnitude. We have also developed a novel Deep Active Learning framework used for the automatic annotation of EEG reports with HAD tags.

Past Sponsored Projects:

Natural language processing to extract actionable findings from radiology reports (Collaboration with University of Texas Southwestern Medical Center at Dallas)

This project aims to take historical data from the EMR for a single relatively well understood disease process (appendicitis) where a radiology exam (CT scan) is known to lead to an improved outcome (decreased incidence of negative appendectomies) and apply NLP algorithms to text reports (radiology and pathology) in order to identify significant radiographic, pathologic, and physical exam findings. Ultimately this information can be cross-referenced with laboratory test results to form the basis of an automated decision support tool for physicians ordering radiology studies, radiologist dictating reports, and physicians interpreting radiology reports. The primary goal of this project is to validate/evaluate the utility of NLP algorithms in analyzing radiology text reports. To test this hypothesis patients with an admission or discharge (ICD9) diagnosis of appendicitis or suspected appendicitis will be identified. This group will be divided into three cohorts. Those that did not have a surgery, those that had a surgery with pathology positive appendicitis, and those that had a surgery and no appendicitis was found. Radiology reports will be analyzed for words or phrases that are associated with pathology proven appendicitis. The null hypothesis is that there is no difference between CT findings historically associated with appendicitis in the medical literature and those identified by computational NLP methods.

Cohort Shephert: Discovering Cohort Traits from Hospital Visits

This project is concerned with content-based retrieval of electronic medical records (EMRs) and it produced a system that participated in the TREC EMR-retrieval track. Participants in this evaluation were given a set of EMRs from the University of Pittsburgh BLU-Lab NLP Repository as well as a mapping between hospital visits and medical records. Additionally, we have been provided with a set of sample topics. These topics are based on a list of priority areas created by the Institute of Medicine (CCERP and Institute of Medicine). Each topic targets certain cohorts (i.e. groups of people sharing a common attribute) and is designed to find a population over which comparative effectiveness studies can be done. The goal of this retrieval system is to return a ranked list of hospital visits that satisfy the requirements expressed in each topic. A hospital visit is a set of electronic medical records that pertain to a single patients visit to the hospital. As each hospital visit contains multiple EMRs (as many as 415), producing a ranked list of hospital visits is much more complicated than retrieving a ranked list of individual documents when using a query as complex as a topic. Moreover, hospital visits may consist of multiple types of EMRs, e.g. an operating room report, multiple radiology reports, a discharge summary, and other reports detailing physical findings, plans of treatments, descriptions of the patients problem, or laboratory test results. In a hospital visit, many EMRs are generated for a patient, but only a few of them may be relevant to the topic of interest. Because of this, the content-based retrieval system that we built for this evaluation operates at the hospital visit level instead of the EMR level.

i2b2 - Informatics for Integrating Biology at the Bedside

The i2b2 (Informatics for Integrating Biology and the Bedside) project was inspired by the i2b2 NIH-funded initiative led by the National Center for Biomedical Computing based at Partners HealthCare System. The i2b2 project contributes to a scalable informatics framework that will enable clinical researchers to use existing clinical data for discovery research and facilitate the design of targeted therapies for individual patients with diseases having genetic origins. One of the goals of this project is to have yearly participations into the i2b2 shared challenges and their affiliated tasks. We first participated in 2010. Our systems obtained the excellent scores for the ability of detect medical concepts and their inter-relations as well as assertions in clinical notes. We participated again in 2011, obtaining excellent results in the recognition of coreerence in clinical texts.

Advanced Pragmatics for Natural Language Processing

The research objectives of this project capture the complex pragmatic phenomena encountered in textual discourse. Inference for several difficult problems is considered: (1) event coreference and induction of event structures from massive text collections; (2) discourse parsing, with a special focus on recognition of elaborations; (3) causal inference; (4) temporal inference and (5) spatial inference. To be able to evaluate our techniques, we participate in the SemEval workshops, where we had a obtained some of the best scores since 2004 in a variety of semantic processing tasks.

AQUAINT AQUINAS Project (Collaboration with Stanford University and ICSI Berkeley)

AQUINAS is an abbreviation for Answering QUestions using INference and Advanced Semantics. The research involves innovations in (1) language analysis, (2) question processing based on complex semantics, (3) indexing using semantic information, (4) extraction and inference of answers, (5) use of corpora and knowledge bases for Question Answering, and (6) learning techniques for abductive reasoning.

AQUAINT Computational Implicatures for Advanced Question Answering (Collaboration with ICSI Berkeley)

The capability of interpreting question implicatures in advanced Question Answering systems is a very important feature. When using a Question Answering system to find information, a professional analyst cannot separate his/her intentions and beliefs from the formulation of the question and therefore (s)he incorporates intentions and beliefs in the interrogation. Moreover, beyond the question, the analyst sometimes makes a proposal or an assertion. This implied information, not recognizable at the syntactic or semantic level, has great importance in the interpretation of a question, and therefore in the quality of the answers returned by a Questions Answering system. This project concerns with the study and development of computational methods that enable coercions of implicatures in the context of advanced Question Answering.

NSF CAREER: Reference Resolution for Natural Language Understanding

A major obstacle in building robust systems that extract and interpret information, and summarize and answer questions from texts, is the need to identify the entities referred to by pronouns or other referential expressions. This project extended work in empirical reference resolution with a learning framework for global optimal decisions. Extentions involved the support of semantic consistency between coreferring expressions and bootstrapping based on large sets of training data.

ARP: Knowledge Mining for Open-Domain Information Extraction

This project created the framework for using complex semantic information in efficient information extraction. By developing state-of-the-art semantic parsers trained on PropBank and FrameNet, knowledge can be mined without extensive domain customization.

NSF CADRE: A Tool for Transforming WordNet into a Core Knowledge Base

This project extended a popular database of English words to make it more useful in such tasks as question answering, information retrieval, and summarization. Wordnet is a lexical database for English that has been widely adopted in artificial intelligence and computational linguistics for a variety of practical applications. The basic elements of WordNet are sets of words that are linked according to semantic relations: synonymy, antonymy, super-ordination, and so forth. WordNet is publicly available, widely used, and is currently being transformed into a multi-lingual database. The focus of the project was to enable the usage of WordNet for knowledge-intensive applications by processing the concept definitions known as glosses. The glosses were part-of-speech tagged, syntactically parsed and semantically disambiguated. To find out more and download the extended WordNet, click here.