Say you want to use Llama 3.3 70b Instruct. You could:
- Use it via an API. OpenRouter offers it at ~12 cents / MTok. Azure offers it at 71 cents. Your price may vary.
- Self-host it on a major cloud provider. Azure offers A100 80GB at ~$3.67 / hour. In a day, you could generate ~0.5-1M tokens.
- Self-host it on an emerging cloud provider. Lambda Labs offers A100 80GB at ~$1.79 / hour. Again, ~0.5-1M tokens a day.
Clearly, self-hosting is cheaper if you run it continuously. Let’s say we run for 1 million tokens every day. Then:
- APIs cost 12 – 71 cents
- Azure costs $3.67 x 24 = $88
- Lamba Labs costs $1.79 x 24 = $43
So, the API is between 60 – 700 times cheaper than running it yourself. 60 times cheaper if you move from Lambda Labs to Azure AI Foundry. 700 times cheaper if you move from Azure servers to OpenRouter.
Not 60-700% cheaper. 60 – 700 times cheaper. Instead of spending a million dollars, you can get away with $1,500 – $15,000.
But what if…
- What if you need higher scale? APIs typically offer better scalability than most organizations can configure.
- What if you need higher uptime? Again, APIs typically offer higher uptime than most organizations can handle.
- What if you need lower latency? Check the throughputs. APIs are typically much faster throughput than self-hosting.
- What if you need edge computing? Then you’re GPU is one-time cost. My comparison is irrelevant.
So, are there any reasons to self-host? I’ve seen only a few.
- Fine-tuned models. No one else offers an API version of your model.
- Ultra-high security. You can’t trust even Microsoft, Google, or Amazon with your data. Your chats must remain in your own data center. (In this case, you probably run your own email and cloud services instead of Office 365 or Google Workspace.)
- Learning. You’re curious about what it takes to self-host these models or want to build this skill.
If you don’t have one of the above needs, remember: your GPU costs can shrink from $1,000K to $1.5 – $15K.
Source: OpenAI Deep Research.