McCourt School of Public Policy
Massive Data Institute

Data Blending

More and more unstructured, organic data related to human behavior, beliefs, and opinions are being shared online. Because of their availability and richness, these data are an important source of information for social scientists attempting to characterize and predict human and societal dynamics. They give insights that traditional survey data can miss and are less costly to collect. To help facilitate a broader reach of text analytic methods and tools across social, behavioral and economic research, we are creating a community of social scientists across disciplines that will work on different data blending projects to advance their research.

Data Blending: Tackling the Obstacles

In April, 2019 MDI hosted a panel discussion and conversation about Data Blending in the Bioethics Research Library featuring scholars from multiple institutions. This panel discussion and a conversation moderated by Provost Robert Groves provided a unique look into both the promises and challenges of combining traditional and new forms of data, especially organic data and resulted in a white paper outlining some of these findings. For more information on the event and to read the white paper, please visit the event page.


The MDI Data Blending Portal

Given the range of data maintained at the MDI, we have developed a portal that integrates data from different text data sources to create variables that social scientists can use within their traditional research portfolios. This allows researchers to blend knowledge obtained from unstructured text data, including social media data, with more well-structured variables. The portal gives researchers the flexibility to generate variables at different time scales (daily, monthly, annually), for different subsets of data, by using different data matching, data mining and machine learning algorithms. The portal is being used by researchers analyzing the 2016 U.S. Presidential Election and researchers investigating movement patterns in Iraq and Syria.


The Expandable Open-Source Database (EOS)

The EOS database consists of more than 700 million publicly available, open-source media articles and blog posts actively compiled since 2006. Currently, data is collected from over 20,000 Internet sources at a rate of approximately 100,000 articles per day. EOS includes a web-based search engine that 1) searches the archive by keyword or by using geospatial maps, 2) saves searches, 3) publishes searches, and 4) exports articles in bulk for more detailed analysis. Researchers working in a number of different disciplines use EOS for collecting raw data, frequency counts, topic buzz level, and other variables to support their research goals. Project domains currently using the archive include forced migration, human trafficking, infectious disease, and religious conflict.

ContactLisa Singh, Professor of Computer Science & Research Professor (MDI)