Carnegie Mellon University

Research

Carnegie Mellon’s Software and Societal Systems Department (S3D) hosts an active research group with a highly interdisciplinary approach to software engineering. Indeed, we believe that interdisciplinary work is inherent to software engineering. The field of software engineering (SE) is built on computer science fundamentals, drawing from areas such as algorithms, programming languages, compilers, and machine learning. At the same time, SE is an engineering discipline: both the practice of SE and SE research problems revolve around technical solutions that successfully resolve conflicting constraints. As such, trade-offs between costs and benefits are an integral part of evaluating the effectiveness of methods and tools.

Emerging problems in the area of privacy, security, and mobility motivate many challenges faced by today’s software engineers, motivating new solutions in SE research. Because software is built by people, SE is also a human discipline, and so research in the field also draws on psychology and other social sciences. Carnegie Mellon faculty bring expertise from all of these disciplines to bear on their research, and we emphasize this interdisciplinary approach in our REU Site. Below, you'll find projects we are planning for Summer 2023.

Mentor: Christian Kästner

Description and Significance
The advances in machine learning (ML) have stimulated widespread interest in integrating AI capabilities into various software products and services. Therefore today’s software development team often have both data scientists and software engineers, but they tend to have different roles. In an ML pipeline, there are in general two phases: an exploratory phase and a production phase. Data scientists commonly work in the exploratory phase to train an off-line ML model (often in computational notebooks) and then deliver it to software engineers who work in the production phase to integrate the model into the production codebase. However, data scientists tend to focus on improving ML algorithms to have better prediction results, often without thinking enough about the production environment; software engineers therefore sometimes need to redo some of the exploratory work in order to integrate it into production code successfully. In this project, we want to analyze collaboration between data scientists and software engineers, at technical and social levels, in open source and in industry.

Student Involvement
We want to study how data scientists and software engineers collaborate. To this end, we will identify open source projects that use machine learning for production systems (e.g., Ubuntu's face recognition login) and study public artifacts or we will interview participants in production ML projects. This research involves interviews and analysis of software artifacts. We may also develop exploratory tools to define and document expectations and tests at the interface between different roles in a project. The project can be tailored to the students’ interests, but interests or a background in empirical methods would be useful. Familiarity with machine learning is a plus but not required. Note, this is not a data science/AI project, but a project on understanding *software engineering* practices relevant to data scientists.

Mentor: Steven Wu

Description and Significance
Many modern applications of machine learning (ML) rely on datasets that may contain sensitive personal information, including medical records, browsing history, and geographic locations. To protect the private information of individual citizens, many ML systems now train their models subject to the constraint of differential privacy (DP), which informally requires that no individual training example has a significant influence on the trained model. After well over a decade of intense theoretical study, DP has recently been deployed by many organizations, including Microsoft, Google, Apple, LinkedIn, and more recently the 2020 US Census. However, the majority of the existing practical deployments still focus on rather simple data analysis tasks (e.g., releasing simple counts and histogram statistics). To put DP to practice for more complex machine learning tasks, this project will study new differentially private training methods for deep learning that improve on existing state-of-the-art methods. We will also study how to use DP deep learning techniques to train deep generative models, which can generate privacy-preserving synthetic data—a collection of “fake” data that preserve important statistical properties of the original private data set. This, in turn, will enable privacy-preserving data sharing.

Mentor: Christian Kästner

Description and Significance
Essentially all software uses open source libraries and benefits incredibly from this publicly available infrastructure. However, with reusing libraries also come risks. Libraries may contain bugs and vulnerabilities and sometimes are abandoned; worse malicious actors are increasingly starting to attack software systems by hijacking libraries and injecting malicious code (e.g., see event-stream, Solarwinds, and ua-parser-js). Most projects use many libraries and those libraries have dependencies on their own and we also depend on all kinds of infrastructure, such as compilers and test framework, all of which could be attacked. Detected software supply chain attacks have increased 650% in 2021, after a 430% increase in 2020. This has gotten to the point that the government has stepped in and requires software companies to build a “Software Bill of Material (SBoM)” as a first step to identify what libraries are actually used.

So how can we trust this *software supply chain*, even though we have no contractual relations with the developers of all those libraries? Research might involve studying how developers build trust, when trust is justified, what attacks can be automatically detected and mitigated (e.g., with sandboxing and reproducible builds), and what actual attacks in the real world look like. There is a large range of possible research directions from code analysis to empirical studies of developers and their relationships, each of which can help to secure open source supply chains.

Student Involvement
Depending on student interest, we will investigate different ideas around software supply chains. For example, we could study how the concept of “trust” translates from organizational science to software security in an open source context and how open source maintainers make decisions about security risks (literature analysis, theory building, interviews/survey), see [1] on trust in a different context. We could build tools that automatically sandbox Javascript dependencies and evaluate the overhead of doing so, see [2] for some related prior work. We could study packages removed from npm to identify what typical supply chain attacks look like in practice. The ideal student for this project is interested in open source and software security.

References
[1] Jacovi, Alon, Ana Marasović, Tim Miller, and Yoav Goldberg. "Formalizing trust in artificial intelligence: Prerequisites, causes and goals of human trust in AI." Proc. FAccT (2021).

[2] Gabriel Ferreira, Limin Jia, Joshua Sunshine, and Christian Kästner. Containing Malicious Package Updates in npm with a Lightweight Permission System. In Proceedings of the 43rd International Conference on Software Engineering (ICSE), pages 1334--1346, Los Alamitos, CA: IEEE Computer Society, May 2021.

Mentors: Bogdan Vasilescu and Christian Kästner

Description and Significance
Reuse of open source artifacts in software ecosystems has enabled significant advances in development efficiencies as developers can now build on significant infrastructure and develop apps or server applications in days rather than months or years. However, despite its importance, maintenance of this open source infrastructure is often left to few volunteers with little funding or recognition, threatening the sustainability of individual artifacts, such as OpenSSL, or entire software ecosystems. Reports of stress and burnout among open source developers are increasing. The teams of Dr. Kaestner and Dr. Vasilecu have explored dynamics in software ecosystems to expose differences, understand practices, and plan interventions [1,2,3,4]. Results indicate that different ecosystems have very different practices and interventions should be planned accordingly [1], but also that signaling based on underlying analyses can be a strong means to guide developer attention and affect change [2]. This research will further explore sustainability challenges in open source with particular attention to the interaction between paid and volunteer contributors and stress and resulting turnover.

Student Involvement
Students will empirical study sustainability problems and interventions, using interviews, surveys, and statistical analysis of archival data (e.g., regression modeling, time series analysis for causal inference). What are the main reasons for volunteer contributors to drop out of open source projects? In what situations do volunteer contributors experience stress? In which projects will other contributors step up and continue maintenance when the main contributors leave? Which past interventions, such as contribution guidelines and code of conducts, have been successful in retaining contributors and easing transitions? How to identify subcommunities within software ecosystems that share common practices and how do communities and subcommunities learn from each other? Students will investigate these questions by exploring archival data of open source development traces (ghtorrent.org), will design interviews or surveys, will apply statistical modeling techniques, will build and test theories, and conduct literature surveys. Students will learn state of the art research methods in empirical software engineering and apply them to specific sustainability challenges of great importance. Students will actively engage with the open source communities and will learn to communicate their results to both academic and nonacademic audiences.

References
[1] Christopher Bogart and Christian Kästner and James Herbsleb and Ferdian Thung. How to Break an API: Cost Negotiation and Community Values in Three Software Ecosystems. In Proc. Symposium on the Foundations of Software Engineering (FSE), 2016.

[2] Asher Trockman, Shurui Zhou, Christian Kästner, and Bogdan Vasilescu. Adding sparkle to social coding: an empirical study of repository badges in the npm ecosystem. In Proc. International Conference on Software Engineering (ICSE), 2018.

[3] Bogdan Vasilescu, Kelly Blincoe, Qi Xuan, Casey Casalnuovo, Daniela Damian, Premkumar Devanbu, and Vladimir Filkov. The sky is not the limit: multitasking across github projects. In Proc. International Conference on Software Engineering (ICSE), 2016.

[4] Bogdan Vasilescu, Daryl Posnett, Baishakhi Ray, Mark GJ van den Brand, Alexander Serebrenik, Premkumar Devanbu, and Vladimir Filkov. Gender and tenure diversity in GitHub teams. In Proc. ACM Conference on Human Factors in Computing Systems (CHI), 2015.

Mentors: Joshua Sunshine and Brad Myers

Description and Significance
In the United States alone, software testing labor is estimated to cost at least $48 billion USD per year. Despite widespread automation in test execution and other areas of software engineering, test suites continue to be created manually by software engineers. Automatic Test sUite Generation (ATUG) tools have shown positive results in non-human experiments, but they are not widely adopted in industry.

Prior research provides clues as to why ATUG tools are not used in practice: generation of incorrect tests that the engineer must find and correct, the need for engineers to acquire or synthesize knowledge that may be difficult in practice to obtain, and poor test suite readability, and so on.

In this research initiative, we build upon prior work by viewing the problem through a human-theoretic lens that focuses on supporting the human software engineer’s task of generating a test suite. To that end, we apply a human-focused theory of ATUG tools and explore the theory using cutting-edge human research and prototype tools intended to address this important problem in software engineering.

Mentor: Daniel Klug

Description and Significance
Short-form video apps, foremost TikTok, are the newest and currently also the most popular social media platforms among younger people. The success of short-videos apps is largely based on their high accessibility and ubiquitousness in regards to online social interaction and participation. But a key element of TikTok is the app’s specific, yet, mysterious algorithm that caters individual video feeds for users based on their content consumption and browsing behavior. While first studies are looking to analyze the TikTok algorithm and some basic knowledge exists about it, we have only little understanding about what social media users know about socio-technical aspects of short-video apps when they consume and create video content. In the MINT Lab, we are using qualitative approaches, such as interviews, content analysis, and user observations to research users’ opinions, knowledge, and awareness of social media algorithms as part of using highly popular social media platforms for communication, socialization, and entertainment. Possible research questions are: What are common user understandings of the TikTok algorithm? What are users’ ways of observing how the algorithm might work? How does users’ understanding of algorithms affect their consumption and creation of video content? Such questions aim to better understand social, cultural, and political aspects in social media usage, especially in relation to community guidelines, privacy, ethics, race, gender, and marginalized communities etc. The goal is to study and understand how humans as users interact with social technology and how the use of social media apps is connected to and integrated into our everyday life.

Student Involvement
Students will learn how to design qualitative research projects and how to apply qualitative methods to research socio-technological aspects of social media use and engagement. This can include designing and conducting interviews, designing and conducting user observations, finding and contacting study participants, best practices for conducting user studies, and how to transcribe, code, analyze, and interpret qualitative data (e.g. interviews, observation protocols). Based on quality criteria for qualitative research, students will learn how to develop and validate hypotheses from qualitative user data. The ideal student is familiar with social media platforms, has interest in qualitative research, and is open to conducting interviews and/or observations.

References
- Klug, D., Qin, Y., Evans, M., & Kaufman, G. (2021, June). Trick and please. A mixed-method study on user assumptions about the TikTok algorithm. In 13th ACM Web Science Conference 2021 (pp. 84-92).

- Karizat, N., Delmonaco, D., Eslami, M., & Andalibi, N. (2021). Algorithmic folk theories and identity: How TikTok users co-produce Knowledge of identity and engage in algorithmic resistance. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW2), 1-44.

- Le Compte, D., & Klug, D. (2021, October). “It’s Viral!”-A Study of the Behaviors, Practices, and Motivations of TikTok Users and Social Activism. In Companion Publication of the 2021 Conference on Computer Supported Cooperative Work and Social Computing (pp. 108-111).

- Simpson, E., & Semaan, B. (2021). For You, or For “You"? Everyday LGBTQ+ Encounters with TikTok. Proceedings of the ACM on human-computer interaction, 4(CSCW3), 1-34.

Mentor: Bryan Parno

Description and Significance
Rust is already a rapidly growing mainstream language (e.g., with users in Amazon, Google, Microsoft, Mozilla, and the Linux kernel) designed to produce "more correct" low-level systems code. Rust supports writing fast systems code, with no runtime or garbage collection, but its powerful type system and ownership model guarantee memory- and thread-safety. This alone can rule out a large swath of common vulnerabilities. However, it does nothing to rule out higher-level vulnerabilities, like SQL injection, incorrect crypto usage, or logical errors.

Hence, we are developing a language and tool called Verus, which allows Rust developers to annotate their code with logical specifications for the code's behavior, and it automates the process of mathematically proving that the code meets those specifications. This means we can guarantee the code's correctness, reliability, and/or security at compile time.

Student Involvement
In this project, students will learn more about software verification, write code in Verus and prove it correct, and potentially extend Verus itself with new features.