Ollama has two endpoints: /api/chat and /api/generate.
As stated in this Ollama gitHub issue:
The
/api/chatendpoint takes a history of messages and provides the next message in the conversation. This is ideal for conversations with history.The
/api/generateAPI provides a one-time completion based on the input.
So basically /api/chat remebers history, while /api/generate does not.
LangChain has two interfaces for Ollama: ChatOllama and OllamaLLM.
It is easy to show that the former calls the /api/chat endpoint:
llm = ChatOllama(model="deepseek-r1", temperature=0.0)
llm.invoke("What is the meaning of life?")
# logs HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"
While the latter calls the /api/generate endpoint:
llm = OllamaLLM(model="gemma3:1b-it-qat", temperature=0.0)
llm.invoke("What is the meaning of life?")
# POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"
The LangChain documentation discourages the use of the latter model:
However, LangChain also has implementations of older LLMs that do not follow the chat model interface and instead use an interface that takes a string as input and returns a string as output. These models are typically named without the "Chat" prefix (e.g., Ollama, Anthropic, OpenAI, etc.). These models implement the BaseLLM interface and may be named with the "LLM" suffix (e.g., OllamaLLM, AnthropicLLM, OpenAILLM, etc.). Generally, users should not use these models.
Even the RAG tutorial uses a chat model.
However, if I'm asking a question about document B, I don't want the answer to be tainted by the previous answer about document A.
So I feel the "traditional" LLM model (i.e. the /api/generate endpoint) would be better.
What am I missing?