McCourt School of Public Policy
Massive Data Institute

Working Research and Collaborations

Below is a list of published research, and research that will be published, by our faculty and affiliates.

Singh, L., Wahedi, L., Wang, Y., Kirov, C., Wei, Y., Martin, S., Donato, K., Liu, Y., and Kawintiranon, K. (to appear). Blending Noisy Social Media Signals with Traditional Movement Variables to Predict Forced Migration. In ACM International Conference on Knowledge Discovery and Data Mining (KDD), Anchorage, Alaska, August 4-8, 2019.
Abstract: Worldwide displacement due to war and conflict is at all-time high. Unfortunately, determining if, when, and where people will move is a complex problem. This paper proposes integrating both publicly available organic data from social media and newspapers with more traditional indicators of forced migration to determine when and where people will move. We combine movement and organic variables with spatial and temporal variation within different Bayesian models and show the viability of our method using a case study involving displacement in Iraq. Our analysis shows that incorporating open-source generated conversation and event variables maintains or improves predictive accuracy over traditional variables alone. This work is an important step toward understanding how to leverage organic big data for societal-scale problems.

Mishori, R., Singh, L., and Wei, Y. (2019). #Diversity: Conversations on Twitter about Women and Black Men in Medicine (new window). In Journal of the American Board of Medicine, 32(1): 28-36. DOI: 
Abstract: Discussions about racism, ethnicity, sexism, discrimination, and diversity have increased within medicine, and their impact on the physician workforce, advancement, hiring, wage inequities, mistreatment, and scholarly output, to name a few. Most medical organizations have created policies and initiatives on diversity and inclusion, focusing on supporting underrepresented minorities. Similar discussions are taking place online, including on Twitter, via specific hashtags, such as #BlackMenInMedicine, #ILookLikeASurgeon. News reports suggested some of these hashtags were “trending.” We set out to assess selected hashtags and analyze their spread, as well as whether or how health professional organizations publicized or amplified this emerging discourse on Twitter. Methodology: We computed tweet volume, retweet volume impressions, and spread for selected hashtags and for health-profession organizations.

Staudt, J., Wei, Y., Singh, L., Klimek, S. D., Jensen, J. B., and Baer, A. L. (2019). Automating Response Evaluation for Franchising Questions on the 2017 Economic Census (new window), In National Bureau of Economic Research (NBER), Working Paper No. 25818. DOI:
Paper accepted for publications at the National Bureau of Economic Research. Abstract: Between the 2007 and 2012 Economic Censuses (EC), the count of franchise-affiliated establishments declined by 9.8%. One reason for this decline was a reduction in resources that the Census Bureau was able to dedicate to the manual evaluation of survey responses in the franchise section of the EC. Extensive manual evaluation in 2007 resulted in many establishments, whose survey forms indicated they were not franchise-affiliated, being recoded as franchise-affiliated. No such evaluation could be undertaken in 2012. In this paper, we examine the potential of using external data harvested from the web in combination with machine learning methods to automate the process of evaluating responses to the franchise section of the 2017 EC. Our method allows us to quickly and accurately identify and recode establishments have been mistakenly classified as not being franchise-affiliated, increasing the unweighted number of franchise-affiliated establishments in the 2017 EC by 22-42 percent.

McCabe, B. and Heerwig, J. (2018). Expanding Participation in Municipal Campaigns: Evaluating the Impact of Seattle’s Democracy Voucher Program (new window). In Center for Studies in Demography & Ecology (CSDE).
Executive Summary: In 2015, voters in Seattle approved the Democracy Voucher program to radically reshape the way municipal elections are funded. By providing vouchers to every registered voter in the city, the program aimed to broaden the donor pool and diversify contributors in local elections. Seattle is the first city in the United States to implement this type of public financing program. Launched in the 2017 election, the Democracy Voucher initiative successfully increased the number of residents participating in the campaign finance system. In total, 20,727 residents in Seattle returned their vouchers – more than twice the number that made a cash contribution to a local political candidate. About four percent of Seattle residents participated in the program. While the Democracy Voucher initiative increased participation in the campaign finance system, some groups of Seattle residents were more likely to return their vouchers than others. Wealthy, white and older residents were more likely to participate in the program than low-income, younger and non-white residents. Individuals who were already politically engaged, as measured by previous voting behavior, were more likely to return their vouchers than registered voters who rarely voted in elections. These differential rates of return by race, income, age and political engagement create opportunities for program improvements in 2019. The Democracy Voucher program is beginning to move the contributor pool in a more egalitarian, representative direction. Compared to cash contributors in the 2017 election, participants in the Democracy Voucher program were generally more representative of the Seattle electorate. Low- and moderate-income residents comprise a substantially larger share of voucher users than cash donors. Voucher users are more likely than cash donors to come from the poorest neighborhoods in the city. Residents under 30 years old make up a larger share of voucher users than cash donors.

Churchill, R., Singh, L., and Kirov, C. (2018). A Temporal Topic Model for Noisy Mediums. In Advanced in Knowledge Discovery and Data Mining Pacific Asian Conference (PAKDD), Melbourne, Australia, June 3-6, 2018.

Thaler, J., Wahby, R. S., Tzialla, I., Shelat, A., Walfish, M. (2018). Doubly-efficient zkSNARKs Without Trusted Setup (new window). In 2018 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, May 21-23, 2018. 
Abstract: We present a zero-knowledge argument for NP with low communication complexity, low concrete cost for both the prover and the verifier, and no trusted setup, based on standard cryptographic assumptions. Communication is proportional to d · log G (for d the depth and G the width of the verifying circuit) plus the square root of the witness size. When applied to batched or data-parallel statements, the prover’s runtime is linear and the verifier’s is sub-linear in the verifying circuit size, both with good constants. In addition, witness-related communication can be reduced, at the cost of increased verifier runtime, by leveraging a new commitment scheme for multilinear polynomials, which may be of independent interest. These properties represent a new point in the tradeoffs among setup, complexity assumptions, proof size, and computational cost. We apply the Fiat-Shamir heuristic to this argument to produce a zero-knowledge succinct non-interactive argument of knowledge (zkSNARK) in the random oracle model, based on the discrete log assumption, which we call Hyrax. We implement Hyrax and evaluate it against five state-of-the-art baseline systems. Our evaluation shows that, even for modest problem sizes, Hyrax gives smaller proofs than all but the most computationally costly baseline, and that its prover and verifier are each faster than three of the five baselines

Martin, S. and Singh, L. (2018). Data Analytics and Displacement: Using Big Data to Forecast Mass Movement of People. Maitland, C., editor. In Digital Lifeline?: ICTs for Refugees and Displaced Persons. MIT Press. 

Wei, Y., Singh, L., Buttler, D., and Gallagher, B. (2018). Using Semantic Graphs to Detect Overlapping Target Events and Story Lines from Newspaper Articles. In International Journal of Data Science and Analytics, 5(1): 4160.