Hosting Open Source AI Models with Hugging Face
Open-source AI is an exciting space. There is a lot of research and innovation is taking place here.
These models can be ran locally and privately. Sometimes we need a bit more hardware to accelerate the task we are working on.
I've found Hugging Face's Inference Endpoints to be the best solution to deploy models to the web and interacting with them via an HTTP endpoint.
Hugging Face provides:
- Clear pricing ($/hr)
- One-click deployments
- Development SDK's
- Well-documented
Make sure you have a Hugging Face account and added a valid payment method before continuing. You can check billing to ensure you have an active payment method.
Hugging Face provides a Model Catalog to chose from a selection of popular models. We'll be deploying Facebook's seamless-m4t-v2-large model, a slightly more niche model which is not shown within the Model Catalog.
Click the "Deploy from Hugging Face" button. This opens a modal, allowing us to search through Hugging Face's Model Hub.
At the time of writing this, the Model Hub contains 1,598,698 models!
Search for seamless-m4t-v2-large. Click "Import model".
This navigates us to a configuration page where we can selection CPU/GPU, region, security level, Autoscaling options, and more.
Our goal with seamless-m4t-v2-large is text-to-text translation. Click the "More options" dropdown and change the Task of the model to Translation.
Configure the endpoint to use the hardware and region needed and click "Create Endpoint". It will take a few minutes for the model to be deployed.
It's ready to used once the green "Running" tag is showing:
Some models require custom deployment solutions. If you see "Error" instead of "Running", check the Logs tab.
The Playground's API tab has examples you can plug in to quickly interact with your endpoint:
Replace Bearer hf_XXXXX
with your Access Token and enjoy!
The model will autoscale to 0, stopping resource usage and cost, after 15 minutes when using the default settings.
Clicking "Pause" will immediately scale to 0 as well.
Billing shows Inference and Compute Usage.