Using LLaMA.cpp models in Ruby

Captain's log, stardate d398.y40/AB

In this blog post, we will teach you how to use LLaMa (Meta's AI) models using ruby for your applications and projects.

Llama using a mac laptop

LLaMa is the Meta’s AI that “accidentally” had been shared by torrent and is now available to everyone. LLaMa.cpp is a project that provides a plain C/C++ implementation with optional 4-bit quantization support for faster, lower memory inference, and is optimized for desktop CPUs. Many cool projects have been made with LLaMa.cpp, such as GPT4ALL for instance.

A library in C++ can be used to create ports to other languages, and luckily, a port to Ruby has been made: yoshoku/llama_cpp.rb! This means that now we can directly use LLaMa models like Alpaca or Vicuna in Ruby (and Node.js too, using hlhr202/llama-node).

The instructions on how to run it were a little cryptic, and I couldn’t find a straightforward step-by-step tutorial on how to use it. But the process is already quite simple if you know the steps.

First of all, you need to have llama_cpp.rb in your project:

$ bundle add llama_cpp

Then you need to download the model; the easiest way is to download it from Hugging Face. All the Hugging Face models are stored in git and we can download them, but we need to keep in mind that models are very large (some GBs in size), so it is recommended to use the git large file storage.

To download ggml-vicuna-7b-4bit, for example, you can run:

$ git lfs install
$ git clone [email protected]:chharlesonfire/ggml-vicuna-7b-4bit ./models

Then it is ready to be used in your code. I’ll share with you an experiment that I’m doing in Jambots to implement it:

#!/usr/bin/env ruby

require "llama_cpp"
require "bundler/setup"

model_path = "./models/ggml-vicuna-7b-4bit/ggml-vicuna-7b-q4_0.bin"
prompt = <<~HEREDOC
    Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User’s requests immediately and with precision.
    User: Hello, Bob.
    Bob: Hello. How may I help you today?
    User: Please tell me the largest city in Europe.
    Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.

if ARGV.empty?
    puts "Message required"

messages = ARGV[0]

client = model_path, n_threads: 4, seed: 12)
output = client.completions("#{prompt} #{messages}")

puts output

It runs a bit slowly on my computer, and the arguments of llama.cpp are a bit complicated, but at least now we can use this ecosystem in Ruby and Node.

Enjoy tinkering with it!


Juan Artero

Juan Artero

Separated at birth from his brother Javier, he has now reunited with his frontend counterpart. If they should ever decide to merge in real life, they would become the ultimate full-stack developer every company wants to hire.

comments powered by Disqus

You're one step away from meeting your best partner in business.

Hire Us