CodeGen is an open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.
BSD 3-Clause "New" or "Revised" License


Official release for the CodeGen models (350M, 2B, 6B, 16B) for Program Synthesis. That is, the model translates English into executable code as presented in the paper:

Title: A Conversational Paradigm for Program Synthesis

Authors: Erik Nijkamp*, Bo Pang*, Hiroaki Hayashi*, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong (* indicates equal contribution)

The current version releases the sampling code, while the detailed training code will be released soon.


The model is available on the HuggingFace Hub with a Colab demo here.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-2B-mono")
inputs = tokenizer("# this function prints hello world", return_tensors="pt").to(0)
sample = model.generate(**inputs, max_length=128)
print(tokenizer.decode(sample[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))


This Google Colab notebook allows for sampling from the CodeGen models.


git clone
cd CodeGen

# download the model parameters
# codegen-350M-nl,multi,mono
# wget -P checkpoints && tar -xvf checkpoints/codegen-350M-nl.tar.gz -C checkpoints/
# wget -P checkpoints && tar -xvf checkpoints/codegen-350M-multi.tar.gz -C checkpoints/
wget -P checkpoints && tar -xvf checkpoints/codegen-350M-mono.tar.gz -C checkpoints/
# codegen-2B-nl,multi,mono
# wget -P checkpoints && tar -xvf checkpoints/codegen-2B-nl.tar.gz -C checkpoints/
# wget -P checkpoints && tar -xvf checkpoints/codegen-2B-multi.tar.gz -C checkpoints/
# wget -P checkpoints && tar -xvf checkpoints/codegen-2B-mono.tar.gz -C checkpoints/
# codegen-6B-nl,multi,mono
# wget -P checkpoints && tar -xvf checkpoints/codegen-6B-nl.tar.gz -C checkpoints/
# wget -P checkpoints && tar -xvf checkpoints/codegen-6B-multi.tar.gz -C checkpoints/
# wget -P checkpoints && tar -xvf checkpoints/codegen-6B-mono.tar.gz -C checkpoints/
# codegen-16B-nl,multi,mono
# wget -P checkpoints && tar -xvf checkpoints/codegen-16B-nl.tar.gz -C checkpoints/
# wget -P checkpoints && tar -xvf checkpoints/codegen-16B-multi.tar.gz -C checkpoints/
# wget -P checkpoints && tar -xvf checkpoints/codegen-16B-mono.tar.gz -C checkpoints/

# create a virtual environment with requirements
python3.8 -m venv .venv
source .venv/bin/activate
pip3 install --upgrade pip setuptools
pip3 install -r requirements.txt

# sample from the model with an arbitrary context
python3 -m jaxformer.hf.sample --model codegen-350M-mono --context "def hello_world():"

Released Models

We release models of various sizes trained on various datasets. The models are named in the following format:


model-size has 4 options: 350M, 2B, 6B, 16B, which represent the number of parameters in each model.

data has 3 options: nl, multi, mono.

  • nl models are randomly initialized and trained on The Pile, a 825.18 GB English text corpus.
  • multi models are initialized from nl models and then trained on a corpus with code data consisting of multiple programming languages.
  • mono models are initialized from multi models and then trained on a corpus with Python code data.

The model names can be provided to the --model flag for See a sample usage above in Setup.


If you find our code or paper useful, please cite the paper:

  title={A Conversational Paradigm for Program Synthesis},
  author={Nijkamp, Erik and Pang, Bo and Hayashi, Hiroaki and Tu, Lifu and Wang, Huan and Zhou, Yingbo and Savarese, Silvio and Xiong, Caiming},
  journal={arXiv preprint},


Our code is BSD-3 licensed. See LICENSE.txt for details.