2

Ollama has two endpoints: /api/chat and /api/generate.

As stated in this Ollama gitHub issue:

The /api/chat endpoint takes a history of messages and provides the next message in the conversation. This is ideal for conversations with history.

The /api/generate API provides a one-time completion based on the input.

So basically /api/chat remebers history, while /api/generate does not.

LangChain has two interfaces for Ollama: ChatOllama and OllamaLLM.

It is easy to show that the former calls the /api/chat endpoint:

llm = ChatOllama(model="deepseek-r1", temperature=0.0)
llm.invoke("What is the meaning of life?")
# logs HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"

While the latter calls the /api/generate endpoint:

llm = OllamaLLM(model="gemma3:1b-it-qat", temperature=0.0)
llm.invoke("What is the meaning of life?")
# POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"

The LangChain documentation discourages the use of the latter model:

However, LangChain also has implementations of older LLMs that do not follow the chat model interface and instead use an interface that takes a string as input and returns a string as output. These models are typically named without the "Chat" prefix (e.g., Ollama, Anthropic, OpenAI, etc.). These models implement the BaseLLM interface and may be named with the "LLM" suffix (e.g., OllamaLLM, AnthropicLLM, OpenAILLM, etc.). Generally, users should not use these models.

Even the RAG tutorial uses a chat model.

However, if I'm asking a question about document B, I don't want the answer to be tainted by the previous answer about document A.

So I feel the "traditional" LLM model (i.e. the /api/generate endpoint) would be better.

What am I missing?

robertspierre
  • 216
  • 2
  • 11

1 Answers1

3

/generate should be used for base models, while /chat should be used for instruction-tuned models.

Base models are trained as next token predictors. They take some blob of text and extend it through next-token prediction. This is what the /generate endpoint is designed to facilitate.

Instruction-tuned models are trained with SFT/RL to behave like an "assistant" engaged in a dialogue with a "user". You can read more about this here.

From an inference standpoint, base models will simply keep generating text as long as you allow them to, while instruction-tuned models expect to receive context as a list of messages alternating between a user and an assistant. The instruction tuned model will write the assistant reply message, then stop to wait for the next user input.

The /chat endpoint is designed to structure/impose the user/assistant dialogue format. Most modern models models are instruction-tuned and specifically expect the chat format (exceptions would be if you are intentionally using a base model). For a simple answer, I would generally use the /chat format unless you have a specific reason not to.

One point to note is when you use a /chat endpoint, you are providing the message history in the API request. To your point of "if I'm asking a question about document B, I don't want the answer to be tainted by the previous answer about document A" - you can just start a new conversation or delete messages about Document A. You control what messages are sent to the model.

Karl
  • 1,176
  • 5
  • 7