BlogJune 2, 2024

How to Run Llama3 70B on a Single 4GB GPU Locally

Fahd Mirza

This video shows how to locally run full Llama3 70B model on 4GB GPU VRAM.

from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("v2ray/Llama-3-70B")
input_text = [        
  'What is the capital of Australia?'    
]
input_tokens = model.tokenizer(input_text,    
  return_tensors="pt",     
  return_attention_mask=False,     
  truncation=True,     
  max_length=MAX_LENGTH,     
  padding=False)

generation_output = model.generate(    
  input_tokens['input_ids'].cuda(),     
  max_new_tokens=20,    
  use_cache=True,    
  return_dict_in_generate=True)

output = model.tokenizer.decode(generation_output.sequences[0])
print(output)

Share this post:

Let's Partner

If you are looking to build, deploy or scale AI solutions — whether you're just starting or facing production-scale challenges — let's chat.

Send me a message

Subscribe to Fahd's Newsletter

Weekly updates on AI, cloud engineering, and tech innovations

How to Run Llama3 70B on a Single 4GB GPU Locally

Recent posts

Let's Partner

Subscribe to Fahd's Newsletter