Written by Carrie McDonald, MDI Journalism Intern
Led by MDI Postdoctoral Fellow Dr. Le Bao, the Massive Data Institute (MDI) held the first installment of its Fall 2023 Data Science Workshop Series on September 23-24 to kickstart this semester’s focus on examining methods for using text as data.
Each semester, MDI sponsors a series of technical workshops for Georgetown faculty, staff, and students to introduce them to state-of-the-art methods, programming paradigms, and technologies. To welcome participants with varied backgrounds in programming and statistical methods, the workshop series aims to balance covering the basics as well as the more technical aspects.
“It’s this diversity of experience that’s valuable to the participants and to us,” Bao said. “It brings a mix of perspectives to the table, sparking some lively discussions and exchanges of ideas.”
This first workshop, entitled “Measurement and Inference Issues with Text Data,” provided an introductory overview of different applications of text as data, laying the foundation for more advanced workshops later this semester that will focus on specific machine learning and natural language processing (NLP) techniques.
“I aimed to help participants understand the unique challenges that come with analyzing text data and how these relate to the conventional measurement and inference issues we see in statistics,” Bao said.
Brian Holland, an attendee, MDI Scholar, and a Data Science For Policy student at the McCourt School of Public Policy, said that enjoyed the “very casual and intimate” atmosphere, in which participants were encouraged to ask questions and interact with Bao and each other.
“The participants were engaged, and there were a lot of thoughtful questions and discussions, which also inspires me to think more about the issues,” Bao said. “I raised an issue from a previous study that suggests that sometimes pronouns have substantive indications and shouldn’t be used as stop words in text analysis, which generated a lot of discussion.”
Using Google Colab, Bao interactively walked participants through several coding exercises, such as plotting the top terms used by candidates in a presidential debate, to enable them to actively use text analysis tools and methods under his guidance.
“This hands-on approach facilitated a clearer understanding of the concepts and equipped attendees with practical skills that could be put into immediate use,” Edward Chen, an attendee and senior in the McDonough School of Business, said.
“The experience of walking through the code as it’s explained was really good,” Holland said. “Sometimes the academic, cited, and constructed reasoning for things can be confusing or unmotivating, but seeing the code to accomplish the tasks sometimes makes it abundantly clear.”
The use of concrete examples throughout the workshop also helped to highlight the myriad of ways in which participants can apply what they learned to their own work.
“The event underscored the practical utility of text analysis within the sphere of public policy by highlighting its real-world applications,” Chen said. “This made it clear that the knowledge and skills gained from the event could have a direct impact on addressing pressing societal challenges.”
The next MDI Data Science Workshop, entitled “Advanced Models Using Text,” will take place on October 23-24 from 4:00-5:30 p.m. Taught by MDI Fellow Dr. Helge Marahrens, the sessions will build upon the introductory material taught by Bao. Marahens will cover several advanced models, including identifying the most impactful words, categorizing documents by their thematic content, and predicting emotions in text.