Interacting with large language and code models

Description: This project focuses on the application of machine learning (ML) on source code, for example on Java, JavaScript or Python code. This area can be seen as an extension of the famous "Natural Language processing" (NLP) area, but with additional complexities (for example, the generated code runs on computers).

There are different applications of ML on code, for example, translating a code from one language to another (e.g., Java to Kotlin), code completion tasks (used by the IDE such as Visual Studio), software maintenance tasks such as program repair (i.e., to remove a bug from a code) or software testing.

The scientific and industrial community has generated different code models trained with large bodies of data (generally open source projects available on Github).

More recently, the community has also focused on the use of large models (called LLM -Large Language models-) generally trained by large technology companies. For example, OpenAI offers the possibility of interacting with the Codex model, a large model trained from source code.

There are many important challenges in this area, such as how to interact with LLMs to get the best response (this is called "prompt/query engineering"), analyze the quality of responses from these models (for example if the code generated is correct), estimating the energy cost of training code models, among many other challenges.

Contact: Matias Martinez

email: matias.martinez@upc.edu