Foundation Models5 min read

Hugging Face vLLM on HF Jobs: Run a Server in One Command

Spin up a private, OpenAI-compatible vLLM endpoint on Hugging Face Jobs with a single hf jobs run command and pay-per-second billing.

The Brieftide

TL;DR

  • 01Spin up a private, OpenAI-compatible vLLM endpoint on Hugging Face Jobs with a single hf jobs run command and pay-per-second billing.
  • 02The post demonstrates a one-line hf jobs run invocation that launches vLLM, exposes port 8000, and prints a job URL and job id you can use to query the server.
  • 03Run one hf jobs run command that pulls the official vllm image, requests a GPU flavor, and exposes vLLM's port.

Hugging Face published a how-to on June 26, 2026 showing you can spin up a private, OpenAI-compatible vLLM endpoint on HF Jobs with a single command, no servers to provision, and pay-per-second billing. The post demonstrates a one-line hf jobs run invocation that launches vLLM, exposes port 8000, and prints a job URL and job id you can use to query the server.

How do you launch a vLLM server on HF Jobs?

Run one hf jobs run command that pulls the official vllm image, requests a GPU flavor, and exposes vLLM's port. The example uses:

hf jobs run --flavor a10g-large --expose 8000 --timeout 2h
vllm/vllm-openai:latest
vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000

When the job starts the command prints a job id (example: 6a381ca1953ed90bfb947332) and a URL such as https://huggingface.co/jobs/qgallouedec/6a381ca1953ed90bfb947332 and an exposed-port address (https://6a381ca1953ed90bfb947332--8000.hf.jobs). Wait for the logs to show "Application startup complete" before sending requests.

Prerequisites include a payment method or positive prepaid credit, huggingface_hub >= 1.20.0, and a local hf auth login. Jobs are billed per second, so stop the server with hf jobs cancel when finished. The post notes an a10g-large runs at $1.50/hour and recommends checking hf jobs hardware for the full price list and to pick the smallest flavor that fits your model.

How do you query the running endpoint?

vLLM speaks the OpenAI API and accepts your HF token as the bearer token. The post gives a curl example that calls the chat completions endpoint on the exposed URL and returns OpenAI‑style JSON. The minimal curl shown is:

curl https://--8000.hf.jobs/v1/chat/completions
-H "Authorization: Bearer $(hf auth token)"
-H "Content-Type: application/json"
-d '{ "model": "Qwen/Qwen3-4B", "messages": [{"role": "user", "content": "Hello!"}], "chat_template_kwargs": {"enable_thinking": false} }'

The example response holds the assistant message choices[0].message.content with the reply "Hello! How can I assist you today? 😊". Requests must include an HF token with read access to the job; a plain browser visit will be rejected because the jobs proxy gates the API.

The post also shows calling the endpoint via the OpenAI-compatible Python client by setting base_url to https://--8000.hf.jobs/v1 and api_key to huggingface_hub.get_token().

Can you scale to larger models or add features like SSH and UI?

Yes. The same pattern scales to much larger models by choosing a beefier --flavor and instructing vLLM to shard across GPUs with --tensor-parallel-size. For example, the guide shows the 122B Qwen3.5 mixture-of-experts model on 2x H200 with this command:

hf jobs run --flavor h200x2 --expose 8000 --timeout 2h
vllm/vllm-openai:latest
vllm serve Qwen/Qwen3.5-122B-A10B
--host 0.0.0.0 --port 8000 --tensor-parallel-size 2
--max-model-len 32768 --max-num-seqs 256

The post explains that --tensor-parallel-size should match the number of GPUs in the flavor (h200x2 -> 2) and that Qwen3.5-122B has a 256K-token default context, so the example caps context and concurrent sequences to fit GPU memory. For interactive UIs the author demos a few lines of Gradio that point at the same endpoint and recommends adding --reasoning-parser deepseek_r1 to stream model "thinking" into a separate field.

SSH into the running container is supported by launching with --ssh and registering your public key at huggingface.co/settings/keys; then connect with hf jobs ssh . The post notes SSH support requires huggingface_hub >= 1.20.0.

Why it matters

HF Jobs gives direct, Docker-run style control on hosted infrastructure: you pick the image, exact vLLM flags, and hardware, and you pay per second. That makes Jobs a fast path for experiments, evals, batch generation, or trying a model before committing. Hugging Face contrasts Jobs with Inference Endpoints, saying Endpoints are the production-ready option with finer-grained access control and scale-to-zero so you are not billed during inactivity.

What to watch

Monitor hf jobs hardware and the HF Jobs price list for available GPU flavors and hourly costs, and watch for model-specific startup guidance such as tensor-parallel-size and --max-model-len when you attempt very large models like Qwen3.5-122B. Cancel jobs explicitly to avoid ongoing per-second billing.

Advertisement

Written by The Brieftide · Source: Hugging Face

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement