why部署大模型ollama and how 原创邦

why部署大模型ollama and how

一般情况下，我们就直接使用OpenAI or 文言 or other。
某些时候数据安全的需要，不能泄露给OpenAI等。
这时候就需要自建了。

一些好的模型 such as Mistral, Mixtral and Lama. 然后根据你的需求选择合适的模型。

Llama，由Meta推出，是一个注重性能和安全性的大型语言模型。它的最新版本Llama 3预计将拥有超过1400亿个参数，预示着其在处理复杂任务和大数据集方面的巨大潜力。Llama 3的设计理念在于提升模型的理解能力和响应精度，同时确保在回答敏感或有争议问题时的审慎。

Gemma，谷歌的产物，以其开源性质和灵活性著称。Gemma模型提供了两种规模的版本，旨在满足从设备部署到高性能计算需求的多样化应用。Gemma的轻量级设计使其能够快速适应并优化各种自然语言处理任务。

Mistral，虽然关于它的信息相对较少，但它被认为是一个强大的竞争者，与Llama和Gemma并驾齐驱。Mistral的特点可能在于其独特的算法和应用领域，为AI技术的发展贡献了新的视角和解决方案。

Llama 3、Gemma和Mistral是人工智能领域的重要模型，它们在设计理念、模型大小、技术实现和开源策略等方面各有特点。

比如用 Mistral. It is about 4GB in size and outperformance GPT-3.5 on most metrics. For it’s size Mistral is the best model in my opinion.

Probably the best open-source model currently available is Mixtral, which out performs Mistral, but it is a huge model and requires at least 48GB of RAM to run.

The Mistral model, which we will be running locally, is a 7 billion parameter model.
Mixtral is a 70 billion parameter model.

It works this way, all of these LLMs are neural networks. A neural network is a collection of neurons, and each neuron connects to all of the other neurons in the proceeding layers.

Basic Neural Network

Each connection has a weight, which is usually a percentage. Each neuron also has a bias which modifies the data as it passed through that node.

The whole purpose of a neural network is to “learn” a very advanced algorithm which is effectively a pattern matching algorithm. In the case of LLMs, by being trained of huge amounts of text, it learns the ability to predict text patterns and so can generate meaningful responses to our prompts.

In simple terms the parameters are the number of weights and biases in the model. This tends to give us an idea of how many neurons are in the neural network. For a 7 billion parameter model there will be something on the order of 100 layers, with thousands of neurons per layer.

The general idea is the more parameters the more accurate the model. Though this doesn’t always play out.

To put in context GPT-3.5 has about 175 billion parameters. It’s actually quite amazing that Mistral with 7 billion parameters can outperform GPT-3.5 in many metrics.

Software to run models locally

In order to run our open source models locally it is necessary to download software to do this. While there are several options on the market the simplest I found, and the one which will run on an Intel Mac, is Ollama.

Right now Ollama runs on Mac and Linux, with Windows coming in the future. Though you can use WSL on Windows to run a Linux shell.

Ollama allows you to download and run these open source models. It also opens up the model on a local port giving you the ability to make API calls via your Ruby code. And this is where it gets fun as a Ruby developer. You can write Ruby apps that integrate with your own local models.

Setting Up Ollama

Installation of Ollama is straightforward. Just download the software and it will install the package. Ollama is primarily command-line based, making it easy to install and run your models. Just follow the download steps and you will be set up in about 5 minutes.

Download Ollama at https://olama.ai/

Ollama home page

Installing your first model

Once you have Ollama set up and running, you should see the Ollama icon in your task bar. This means it’s running in the background and will be able to run your models.

The next step is to download the open-source model you plan to use. In our case we will download Mistral.

Open your terminal
Run the following command:

ollama run mistral

The first time this command is executed it will download Mistral, which will take some time as it is about 4GB in size.

Once it has finished downloading it will open the Ollama prompt and you can start communicating with Mistral.

Running Mistral in terminal

Next time you run `ollama run mistral` it will just run the model.

Customising Models

With Ollama you can create customisations to the base model. This is a little like creating custom GPTs in OpenAI.

Full details are provided in the Ollama documentation.

The steps to create a custom model are fairly simple:

Create a `Modelfile`
Add the following text to the Modelfile:

FROM mistral

# Set the temperature set the randomness or creativity of the response
PARAMETER temperature 0.3

# Set the system message
SYSTEM """
You are an excerpt Ruby developer. You will be asked questions about the Ruby 
Programming language. You will provide an explanation along with code examples.
"""

You can change the temperature to suite your needs. 1 is highly creative, or more precisely, ‘random’, while 0 is the most precise.

The system message is what primes the AI model to respond in a given way.

These steps will create the new model:

To create the new model run the following command in the terminal:

ollama create <model-name> -f './Modelfile'

In my case, I am naming the model ‘ruby’.

ollama create ruby -f './Modelfile'

This will create the new model called ruby.

List your models with the following command:

ollama list

Now you can run the custom model

ollama run ruby

Integrating with Ruby

Although there’s no dedicated gem for Ollama yet (though I am working on it), Ruby developers can interact with the model using basic HTTP request methods. Ollama runs in the background, and it opens up the model on port `11434`, so you can access it via `http://localhost:11434’.

The Ollama API documentation provides the different endpoints for the basic commands such as chat and creating embeddings.

For us we want to work with the `/api/chat` endpoint to send a prompt to the AI model.

Here is some basic Ruby code for interacting with the model.

require 'net/http'
require 'uri'
require 'json'

uri = URI('http://localhost:11434/api/chat')

request = Net::HTTP::Post.new(uri)
request.content_type = 'application/json'
request.body = JSON.dump({
  model: 'ruby',
  messages: [{
    role: 'user',
    content: 'How can I covert a PDF into text?',
  }],
  stream: false
})

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
  http.read_timeout = 120
  http.request(request)
end

puts response.body

The Ruby code does the following:

The code starts by requiring three libraries: ‘net/http’, ‘uri’, and ‘json’. These libraries are used for making HTTP requests, parsing URIs, and handling JSON data respectively.
A URI object is created with the address of the API endpoint (‘http://localhost:11434/api/chat').
A new HTTP POST request is created using the Net::HTTP::Post.new method with the URI as the argument.
The content type of the request is set to ‘application/json’.
The body of the request is set to a JSON string that represents a hash. This hash contains three keys: ‘model’, ‘messages’, and ‘stream’. The ‘model’ key is set to ‘ruby’ which is our model, the ‘messages’ key is set to an array containing a single hash representing a user message, and the ‘stream’ key is set to false.
The messages hash follows a model for interacting with AI models. It takes a role and the content. The roles can be system, user and assistance. System is the priming message for how the model should respond. We already set that in the Modelfile. The user message is our standard prompt, and the model will respond with the assistant message.
The HTTP request is sent using the Net::HTTP.start method. This method opens a network connection to the specified hostname and port, and then sends the request. The read timeout for the connection is set to 120 seconds given that I am running on a 2019 Intel Mac, the responds can be a little slow. This isn’t an issue running on the appropriate AWS servers.
The response from the server is stored in the ‘response’ variable.

Practical Use Cases

The real value of running local AI models comes into play for companies dealing with sensitive data. These models are really good at processing unstructured data, like emails or documents, and extracting valuable, structured information.

For one use case I am training the model on all of the customer information in a CRM. This allows users to ask questions about the customer without needing to go through sometimes hundreds of notes. It’s basically ChatGPT for their customer data.

Conclusion

Where security is not an issue I am more likely to work directly with OpenAI. But for companies that need private models, then Open Source is definitely the way to go.

If I get around to it, one of these days I’ll write a Ruby wrapper around the Ollama APIs to make it a little easier to interact with.

Have fun working with open source models.

From Here

阅读量: 1245
发布于: 2024-06-27
修改于: 2024-06-27