Hosting Open Source AI Models with Hugging Face

April 13, 2025 · ai, open-source, huggingface

Open-source AI is an exciting space. There is a lot of research and innovation is taking place here.

These models can be ran locally and privately. Sometimes we need a bit more hardware to accelerate the task we are working on.

I've found Hugging Face's Inference Endpoints to be the best solution to deploy models to the web and interacting with them via an HTTP endpoint.

Hugging Face provides:

Clear pricing ($/hr)
One-click deployments
Development SDK's
Well-documented

Make sure you have a Hugging Face account and added a valid payment method before continuing. You can check billing to ensure you have an active payment method.

Hugging Face provides a Model Catalog to chose from a selection of popular models. We'll be deploying Facebook's seamless-m4t-v2-large model, a slightly more niche model which is not shown within the Model Catalog.

hugging face model catalog

Click the "Deploy from Hugging Face" button. This opens a modal, allowing us to search through Hugging Face's Model Hub.

At the time of writing this, the Model Hub contains 1,598,698 models!

Search for seamless-m4t-v2-large. Click "Import model".

Hugging Face Hub search input

This navigates us to a configuration page where we can selection CPU/GPU, region, security level, Autoscaling options, and more.

Our goal with seamless-m4t-v2-large is text-to-text translation. Click the "More options" dropdown and change the Task of the model to Translation.

model task adjusted to Translation

Configure the endpoint to use the hardware and region needed and click "Create Endpoint". It will take a few minutes for the model to be deployed.

It's ready to used once the green "Running" tag is showing:

running endpoint

Some models require custom deployment solutions. If you see "Error" instead of "Running", check the Logs tab.

The Playground's API tab has examples you can plug in to quickly interact with your endpoint:

Playground example

Replace Bearer hf_XXXXX with your Access Token and enjoy!

The model will autoscale to 0, stopping resource usage and cost, after 15 minutes when using the default settings.

Clicking "Pause" will immediately scale to 0 as well.

Billing shows Inference and Compute Usage.