DeepMind claims its new code-generating system can compete with human programmers


Last year, the San Francisco-based research lab released OpenAI Codex, an AI model for translating natural language commands into app code. The model, which powers GitHub’s Copilot feature, was heralded at the time as one of the most powerful examples of machine programming, the category of tools that automates software development and maintenance.

Not to be outdone, DeepMind — the AI ​​lab backed by Google parent company Alphabet — claims to have improved Codex in key areas with AlphaCode, a system that can write competitive-level code. In coding competitions hosted on Codeforces, a coding competition platform, DeepMind claims that AlphaCode achieved an average top ranking 54.3% over 10 recent competitions with more than 5,000 participants each.

DeepMind lead researcher Oriol Vinyals says it’s the first time a computer system has reached such a competitive level in all programming competitions. “AlphaCode [can] read the natural language descriptions of an algorithmic problem and produce code that not only compiles, but is correct,” he added in a statement.[It] indicates that there is still work to be done to reach the highest performing level and improve the problem-solving capacity of our AI systems. We hope this benchmark will lead to further innovations in troubleshooting and code generation.”

Learning to code with AI

Machine programming has been powered by AI in recent months. At the Build developer conference in May 2021, Microsoft outlined a new feature in Power Apps that uses OpenAI’s GPT-3 language model to help people choose formulas. Intel’s ControlFlag can autonomously detect errors in code. And Facebook’s TransCoder converts code from one programming language to another.

The range of applications is huge – which explains why there is such a rush to create such systems. According to a study by the University of Cambridge, at least half of developers’ efforts are spent on debugging, costing the software industry an estimated $312 billion a year. AI-powered code suggestion and review tools promise to cut development costs while allowing programmers to focus on creative, less repetitive tasks — assuming the systems work as advertised.

Like Codex, AlphaCode — the largest version of which contains 41.4 billion parameters, about four times the size of Codex — was trained on a snapshot of public repositories on GitHub in the programming languages ​​C++, C#, Go, Java, JavaScript, Lua, PHP, Python, Ruby, Rust, Scala and TypeScript. AlphaCode’s training dataset was 715.1 GB – about the same size as Codex’s, which OpenAI estimated to be “more than 600 GB”.

An example of the interface AlphaCode used to answer programming challenges.

In machine learning, parameters are the part of the model learned from historical training data. Overall, the correlation between the number of parameters and the refinement has held up remarkably well.

Architecturally speaking, AlphaCode is a so-called Transformer-based language model – similar to Salesforce’s code-generating CodeT5. The Transformer architecture consists of two core components: an encoder and a decoder. The encoder contains layers that iteratively process input data, such as text and images, layer by layer. Each encoder layer generates codings with information about which parts of the inputs are relevant to each other. They then pass these codings on to the next layer before reaching the last coding layer.

Create a new benchmark

Transformers typically undergo semi-supervised learning, which involves unsupervised pre-training, followed by supervised fine-tuning. Semi-supervised learning, which is between supervised and unsupervised learning, accepts data that is partially labeled or most of the data has no labels. In this case, Transformers are first subjected to “unknown” data for which no previously defined labels exist. During the reconciliation process, Transformers train on labeled data sets so that they learn to perform certain tasks, such as answering questions, analyzing sentiment, and paraphrasing documents.

In the case of AlphaCode, DeepMind refined the system and tested it on CodeContests, a new dataset the lab created with problems, solutions, and test cases scraped from Codeforces with public programming datasets mixed in. DeepMind also tested the highest-performing version of AlphaCode — an ensemble of the 41 billion-parameter model and a 9-billion-parameter model — on actual programming tests on Codeforces, running AlphaCode live to generate solutions to each problem.

On CodeContests, with up to a million examples per problem, AlphaCode solved 34.2% of the problems. And on Codeforces, DeepMind claims it was among the top 28% of users who entered a competition in the past six months in terms of overall performance.

“The latest DeepMind paper is another impressive piece of engineering showing that impressive benefits can still be gained from our current Transformer-based models with ‘just’ the right sampling and training adjustments and without fundamental changes to the model architecture,” Connor Leahy, a member of the open AI research effort EleutherAI, told VentureBeat via email. “DeepMind brings out the full toolbox of tweaks and best practices by using clean data, large models, a whole host of smart training tricks and of course a lot of computing power. DeepMind has pushed the performance of these models much faster than even I expected. The result of the competitive programming of the 50th percentile is a huge leap, and their analysis clearly shows that this is not ‘just rote learning’. The progress in coding models from GPT3 to codex to AlphaCode is really astonishingly fast.”

Limitations of Code Generation

Machine programming is by no means a solved science, and DeepMind admits that AlphaCode has limitations. For example, the system does not always produce code that is syntactically correct for every language, especially in C++. AlphaCode also performs worse when generating challenging code, such as that required for dynamic programming, a technique for solving complex math problems.

AlphaCode can be problematic in other ways as well. While DeepMind has not examined the model for bias, code-generating models, including Codex, have been shown to amplify toxic and flawed content in training datasets. For example, Codex may be asked to write “terrorist” when the word “Islam” is entered, generating code that appears superficially correct, but poses a security risk by calling compromised software and using insecure configurations.

Systems like AlphaCode – which, it should be noted, are expensive to produce and maintain – can also be abused, as recent research has shown. Researchers at Booz Allen Hamilton and EleutherAI trained a language model called GPT-J to generate code that could solve introductory computer science exercises, successfully bypassing a widely used plagiarism detection programming software. At the University of Maryland, researchers found it is possible that current language models generate false cybersecurity reports convincingly enough to fool leading experts.

It is an open question whether malicious actors will use these types of systems in the future to automate the creation of malware at scale. That’s why Mike Cook, an AI researcher at Queen Mary University of London, challenges the idea that AlphaCode is bringing the industry closer to “a problem-solving AI.”

“I don’t think this result is too surprising, as text comprehension and code generation are two of the four major tasks in which AI has shown improvements in recent years… A challenge with this domain is that outputs are quite prone to errors. One wrong word or pixel or musical note in an AI-generated story, artwork, or melody might not ruin the whole thing for us, but a single missed test case in a program could bring down space shuttles and destroy economies,” Cook told VentureBeat via email. “So while the idea of ​​bringing the power of programming to people who can’t program is exciting, we have a lot of problems to solve before we get there.”

If DeepMind can solve these problems – and that’s a big if – it can make a nice profit in a constantly growing market. Of the practical domains the lab has recently tackled with AI, such as weather forecasting, material modeling, atomic energy calculation, app recommendations, and data center cooling optimization, programming is one of the most lucrative. Even migrating an existing codebase to a more efficient language like Java or C++ is a hefty sum. For example the Commonwealth Bank of Australia issued approximately $750 million over the course of five years to convert its platform from COBOL to Java.

“I can safely say that AlphaCode’s results exceeded my expectations. I was skeptical because even simple competition problems often require not only implementing the algorithm, but also (and this is the hardest part) figuring it out,” Codeforces founder Mike Mirzayanov said in a statement. “AlphaCode has managed to perform at the level of a promising new competitor. I can’t wait to see what’s in store for us.”

VentureBeat’s mission is to be a digital city square for tech decision makers to learn about transformative business technology and transactions. Learn more

Leave a Reply

Your email address will not be published.