AI Weekly: Researchers Try Open Source Alternative to GitHub Co-Pilot
The The Transform Technology Summits begin October 13 with Low-Code / No Code: Enabling Enterprise Agility. Register now!
Leave him OSS Company Newsletter guide your open source journey! register here.
In June, OpenAI partnered with GitHub to launch Copilot, a service that provides suggestions for entire lines of code in development environments like Microsoft Visual Studio. Powered by an AI model called Codex – which OpenAI then exposed via an API – Copilot can translate natural language into code in over a dozen programming languages, by interpreting commands in plain English and executing them. .
Now, a community effort is underway to create an open source and freely available alternative to the Codex model of Copilot and OpenAI. Dubbed GPT Code Clippy, its contributors hope to create an AI pair programmer that will allow researchers to study large AI models trained on code to better understand their capabilities and limitations.
Open source models
Codex is formed on billions of lines of public code and works with a wide range of frameworks and languages, adapting to changes made by developers to match their coding styles. Likewise, GPT Code Clippy has learned from hundreds of millions of sample code bases to generate code similar to that of a human programmer.
Contributors to the GPT Code Clippy project used GPT-Neo as the basis for their AI models. Developed by the local research collective EleutherAI, GPT-NEo is called a Transformer model. This means that it weighs the influence of different parts of the input data rather than treating all input data the same. Transformers do not need to process the beginning of a sentence before the end. Instead, they identify the context that gives meaning to a word in the sentence, which allows them to process the input data in parallel.
GPT-Neo has been ‘preformed’ on The Pile, an 835 GB collection of 22 smaller datasets comprising academic sources (e.g. Arxiv, PubMed), communities (StackExchange, Wikipedia), code repositories ( Github) and more. Through fine-tuning, GPT Code Clippy’s contributors improved its code comprehension capabilities by exposing their models to repositories on GitHub that met certain search criteria (e.g. had more than 10 GitHub stars and two commits), filtered for duplicate files.
“We used Hugging Face’s Transformers library… to refine our model[s] on various code datasets, including one of our own, which we pulled from GitHub, ”explain the contributors on the GPT Code Clippy project page. “We decided to refine rather than train from scratch, because in OpenAI’s GPT-Codex document they report that training from scratch and fine-tuning the model [result in equivalent] performance. However, fine tuning allowed the model[s] converge faster than training from scratch. Therefore, all versions of our models are refined.
GPT Code Clippy contributors have trained several models to date using third-generation tensor processing units (TPUs), Google’s custom AI accelerator chip available through Google Cloud. While it’s still in its early stages, they’ve created a plugin for Visual Studio and plan to extend the capabilities of GPT Code Clippy to other languages, especially under-represented ones.
“Our ultimate goal is not only to develop an open source version of Github’s co-driver, but one that offers comparable performance and ease of use,” the contributors wrote. “[We hope to eventually] design means to update the version and updates of programming languages.
Promise and setbacks
AI-based coding models are not only useful for writing code, but also when it comes to easier fruits like upgrading existing code. Migrating from an existing code base to a modern or more efficient language like Java or C ++, for example, requires expertise in both the source and target languages - and it’s often expensive. The Commonwealth Bank of Australia spent around $ 750 million over five years to convert its platform from COBOL to Java.
But there are many potential pitfalls, such as bias and unwanted code suggestions. In a recent article, the Salesforce researchers behind CodeT5, a Codex-like system capable of understanding and generating code, recognize that the datasets used to train CodeT5 could encode certain stereotypes such as race and gender. sex from text comments – or even source code. Additionally, they say, CodeT5 could contain sensitive information such as personal addresses and identification numbers. And this can produce vulnerable code that adversely affects the software.
OpenAI also found that Codex can suggest compromised packages, invoke functions insecurely, and produce programming solutions that look correct but don’t actually perform the intended task. The model may also be prompted to generate racist and harmful outputs in code form, such as the word “terrorist” and “violent” when writing code comments with the prompt “Islam”.
The GPT Code Clippy team hasn’t said how it could mitigate any biases that might be present in its open source models, but the challenges are clear. While models can, for example, potentially reduce question-and-answer sessions and repetitive code review comments, they could cause damage if not carefully audited – especially in light of research showing that coding patterns are far from human precision.
For AI coverage, send topical advice to Kyle Wiggers – and be sure to subscribe to the AI Weekly newsletter and bookmark our AI channel, The Machine.
Thanks for reading,
IA personal writer
VentureBeat’s mission is to be a digital public place for technical decision-makers to learn about transformative technology and conduct transactions. Our site provides essential information on data technologies and strategies to guide you in managing your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the topics that interest you
- our newsletters
- Closed thought leader content and discounted access to our popular events, such as Transform 2021: Learn more
- networking features, and more
Become a member