Hugging Face vLLM on HF Jobs: Run a Server in One Command
Spin up a private, OpenAI-compatible vLLM endpoint on Hugging Face Jobs with a single hf jobs run command and pay-per-second billing.
TL;DR
- 01Spin up a private, OpenAI-compatible vLLM endpoint on Hugging Face Jobs with a single hf jobs run command and pay-per-second billing.
- 02The post demonstrates a one-line hf jobs run invocation that launches vLLM, exposes port 8000, and prints a job URL and job id you can use to query the server.
- 03Run one hf jobs run command that pulls the official vllm image, requests a GPU flavor, and exposes vLLM's port.
Hugging Face published a how-to on June 26, 2026 showing you can spin up a private, OpenAI-compatible vLLM endpoint on HF Jobs with a single command, no servers to provision, and pay-per-second billing. The post demonstrates a one-line hf jobs run invocation that launches vLLM, exposes port 8000, and prints a job URL and job id you can use to query the server.
How do you launch a vLLM server on HF Jobs?
Run one hf jobs run command that pulls the official vllm image, requests a GPU flavor, and exposes vLLM's port. The example uses:
hf jobs run --flavor a10g-large --expose 8000 --timeout 2h
vllm/vllm-openai:latest
vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000
When the job starts the command prints a job id (example: 6a381ca1953ed90bfb947332) and a URL such as https://huggingface.co/jobs/qgallouedec/6a381ca1953ed90bfb947332 and an exposed-port address (https://6a381ca1953ed90bfb947332--8000.hf.jobs). Wait for the logs to show "Application startup complete" before sending requests.
Prerequisites include a payment method or positive prepaid credit, huggingface_hub >= 1.20.0, and a local hf auth login. Jobs are billed per second, so stop the server with hf jobs cancel
How do you query the running endpoint?
vLLM speaks the OpenAI API and accepts your HF token as the bearer token. The post gives a curl example that calls the chat completions endpoint on the exposed URL and returns OpenAI‑style JSON. The minimal curl shown is:
curl https://
-H "Authorization: Bearer $(hf auth token)"
-H "Content-Type: application/json"
-d '{
"model": "Qwen/Qwen3-4B",
"messages": [{"role": "user", "content": "Hello!"}],
"chat_template_kwargs": {"enable_thinking": false}
}'
The example response holds the assistant message choices[0].message.content with the reply "Hello! How can I assist you today? 😊". Requests must include an HF token with read access to the job; a plain browser visit will be rejected because the jobs proxy gates the API.
The post also shows calling the endpoint via the OpenAI-compatible Python client by setting base_url to https://
Can you scale to larger models or add features like SSH and UI?
Yes. The same pattern scales to much larger models by choosing a beefier --flavor and instructing vLLM to shard across GPUs with --tensor-parallel-size. For example, the guide shows the 122B Qwen3.5 mixture-of-experts model on 2x H200 with this command:
hf jobs run --flavor h200x2 --expose 8000 --timeout 2h
vllm/vllm-openai:latest
vllm serve Qwen/Qwen3.5-122B-A10B
--host 0.0.0.0 --port 8000 --tensor-parallel-size 2
--max-model-len 32768 --max-num-seqs 256
The post explains that --tensor-parallel-size should match the number of GPUs in the flavor (h200x2 -> 2) and that Qwen3.5-122B has a 256K-token default context, so the example caps context and concurrent sequences to fit GPU memory. For interactive UIs the author demos a few lines of Gradio that point at the same endpoint and recommends adding --reasoning-parser deepseek_r1 to stream model "thinking" into a separate field.
SSH into the running container is supported by launching with --ssh and registering your public key at huggingface.co/settings/keys; then connect with hf jobs ssh
Why it matters
HF Jobs gives direct, Docker-run style control on hosted infrastructure: you pick the image, exact vLLM flags, and hardware, and you pay per second. That makes Jobs a fast path for experiments, evals, batch generation, or trying a model before committing. Hugging Face contrasts Jobs with Inference Endpoints, saying Endpoints are the production-ready option with finer-grained access control and scale-to-zero so you are not billed during inactivity.
What to watch
Monitor hf jobs hardware and the HF Jobs price list for available GPU flavors and hourly costs, and watch for model-specific startup guidance such as tensor-parallel-size and --max-model-len when you attempt very large models like Qwen3.5-122B. Cancel jobs explicitly to avoid ongoing per-second billing.
Written by The Brieftide · Source: Hugging Face
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Foundation ModelsAge of LLM benchmark: 1v1 reasoning, diplomacy, reliability
Arnaud Ricci's Age of LLM runs 54 matches and 5,258 actions to test 15 LLMs under fog of war, diplomacy and strict JSON reliability.
BIM-Edit: Benchmarking LLMs for IFC-based BIM Editing
BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.
QMFOL benchmark: QMFOLBench with 2880 logic instances
QMFOL generates monadic first-order logic problems and ships QMFOLBench with 2880 instances to measure LLM deductive reasoning across.
DeFAb: Defeasible Abduction Benchmark, 372,648+ instances
DeFAb converts four decades of publicly funded knowledge bases into 372.