Deploying Local LLM Model for Concierge

While Concierge service works perfectly on large LLM Models, such as GPT, it might be more convenient to use a local LLM Model. Unlike a large LLM Model, a local one does not require access through the firewall and thus demands fewer security preparations. Metric Insights allows connecting Concierge to the local LLM model, such as Google Gemma. This article describes how to establish a connection between Concierge and the Google Gemma LLM Model.

Prerequisites

Don't install Metric Insights Ollama on the same server with Metric Insights Application and Chatbot. To run a local LLM Model a specialized separate GPU-optimized server that has at least 16GB of video RAM is required. Here are the recommended instances, tested on their compatibility with Ollama:

  • For AWS:
    • g4dn.xlarge
    • g5.xlarge
    • g5.2xlarge
    • p3.2xlarge
  • For Azure:
    • Standard_NC4as_T4_v3
    • NV6ads_A10_v5

Ollama is installed on the separate server using the Metric Insights Ollama Installation Package. The Package can be downloaded via the link provided by the Metric Insights support specialist. Use the default approach to install the application on dedicated server. An example of default installation approach is described in details in Single Server Docker Deployment article.

Data, necessary for Concierge connection:

  1. Address of the server, where Ollama is installed
  2. API Key

Concierge Settings

Access Admin > System > Search Setup and open Concierge tab

  1. Main LLM: Name the main LLM Model, used for Concierge
  2. LLM API Key: Insert the Ollama Secret API Key
  3. LLM API Type: Mention the API type, "openai" or "azure"
  4. LLM API Base URL: Type the server address
  5. [Update All Search Indexes]

Depending on the amount of data, indexing can take a lot of time, but it is necessary for Concierge to work correctly.

For more details on configuring Concierge, check the Install Concierge article.