Baseten
Inference platform · Self-serve · US
Baseten runs production inference for teams that want a model behind an API without operating GPU infrastructure themselves. The platform sits a layer above raw compute, handling cold starts, auto-scaling, and observability so the engineering work focuses on the model rather than the cluster.
On-demand pricing is now public. The H100 80GB rate sits at $6.50/hr with per-minute billing, and the broader catalog spans B200, H100, and A100. The headline rate runs higher than raw-compute providers because the platform layer is built into the price, which is the trade Baseten asks production teams to make.
Common workloads include large language models, transcription, and image generation served at production latency, often with traffic shapes that need scale-from-zero and burst handling. Teams that have outgrown a hobby-tier inference provider but don't want to run their own Kubernetes typically land here.
The Model APIs product, separate from dedicated deployments, lets teams call popular open-weights models like DeepSeek, Kimi, and GLM at per-token rates. The two surfaces share the same infrastructure.
Visit baseten.co→
CoreWeave
Hyperscale neocloud · Self-serve · US, EU
CoreWeave runs one of the larger NVIDIA fleets in the market, with capacity spanning GB200 NVL72, B200, H200, H100, A100, and L40S. The hardware breadth is the headline story for buyers, since it covers everything from a single L40S inference instance up to a multi-thousand-GPU training cluster.
On-demand pricing exists and is self-serve. An 8-GPU HGX H100 node prices at $49.24/hr, which works out to $6.16 per GPU per hour, with billing measured in instance-hours rather than seconds or minutes. Reserved capacity discounts run up to 60% for committed terms, and the company sells contracts measured in months or years as its primary revenue path.
The fit covers sustained training and large-scale inference workloads where high-bandwidth networking and capacity guarantees matter more than per-hour billing granularity. Teams running multi-node training jobs across dozens or hundreds of GPUs tend to land here once the workload outgrows a single-cluster provider.
Procurement on the contract side runs in weeks. Self-serve on-demand spin-up exists for buyers who want to try the platform before committing.
Visit coreweave.com→
Fal
Inference platform · Self-serve · Global
Fal specializes in inference for image, video, and audio models, with a catalog of more than 1,000 production-ready open-weights models exposed behind a unified API. The platform handles cold starts, queueing, and model warming so a single endpoint can serve burst consumer-facing traffic without a self-managed cluster.
The company runs two pricing surfaces. The Model APIs charge per output unit (per image, per video-second, per audio-second), which suits teams that don't want to think about GPU hours at all. Dedicated compute rents H100 capacity at $1.89/hr with per-second billing, which puts Fal near the floor of the table for raw GPU pricing.
The low Compute rate likely reflects economics from the Model APIs side, where the company captures margin on the platform layer. Customers who use the Compute product typically have a model that doesn't fit the standard catalog and want raw access to a GPU at a competitive rate.
Teams shipping consumer-facing media products land here because the Model APIs include current-generation models like Flux, Veo, Kling, and Seedance behind a single integration. Production traffic at scale routes through dedicated endpoints with reserved capacity.
Visit fal.ai→
Lambda
Training-focused · Self-serve · US
Lambda sells GPU compute aimed primarily at training and fine-tuning workloads. Current capacity spans B200, H100, A100, and older Tesla V100s across US regions.
The H100 SXM rate prices at $3.99/hr per GPU on an 8-GPU instance, with single-GPU rentals priced higher at $4.29/hr. The 1-Click Clusters product extends the same hardware into 16-to-2,000-GPU configurations with B200 capacity starting at $9.86/hr per GPU on 16-GPU clusters, scaling down to $8.87/hr at 256+ GPUs. Billing runs per-minute on on-demand, weekly on 1-Click Clusters.
The platform suits training and fine-tuning work where the team wants a familiar Linux box with the standard CUDA stack and the option to scale into multi-node clusters. The "self-serve, first-come access" model means capacity isn't always available at the size you want, which is one trade for the relatively clean pricing structure.
Lambda is no longer the cheapest H100 in the table by a wide margin. The current positioning is closer to "stable training-grade infrastructure with predictable pricing" than "lowest published rate."
Visit lambdalabs.com→
Modal
Serverless compute · Self-serve · US
Modal runs serverless GPU functions. A Python decorator turns a function into a GPU-backed endpoint with fast cold starts and per-second billing only while the GPU runs, which means idle time costs nothing and bursty inference scales cleanly from zero.
The catalog covers B200, H200, H100, A100, and L40S at competitive per-hour rates. H100 prices at $3.95/hr, undercutting Baseten and Replicate while staying above the marketplace floor at Vast and the Community Cloud tier at RunPod. Region selection and non-preemptible execution carry surcharges of 1.5–1.75x and 3x respectively, so the headline rate represents preemptible workloads in the default region.
The premium over the cheapest providers buys scale-to-zero, fast cold starts, and a developer experience designed for iteration rather than infrastructure work. Teams building inference behind variable traffic, batch jobs, or experimentation workflows tend to settle on Modal once the headline rate stops mattering.
AWS and GCP marketplace integrations let large customers spend committed cloud budget on Modal, which materially affects total cost for teams with existing hyperscaler commits.
Visit modal.com→
Nebius
European neocloud · Self-serve · EU, US
Nebius operates large H200, H100, B200, and L40S capacity across EU and US regions, with the European footprint and compliance posture as the primary draw for buyers based in or selling into the EU.
On-demand H100 prices at $2.95/hr with per-second billing, putting Nebius among the cheaper hyperscale-grade options that still own their hardware. Reserved capacity discounts run up to 35% for committed terms, and the unified billing model introduced in late 2025 bundles GPU, vCPU, and RAM into a single per-GPU-hour rate rather than charging components separately.
The platform suits European AI companies handling regulated data or building products for the EU market, and the residency story is genuine rather than marketing language. US workloads route competitively too, particularly for teams that want a self-serve alternative to CoreWeave at training scale.
The fleet includes GB200 and B300 NVLink systems for frontier workloads, though those tier into "Contact us" pricing rather than published rates.
Visit nebius.com→
Nscale
AI infrastructure · Sales-led · EU, UK, US
Nscale operates bare-metal NVIDIA GPU infrastructure for large-scale training and inference workloads. Data centers span Norway, the UK, the US, Portugal, and Iceland, with additional capacity coming online in West Virginia.
The product line spans bare-metal GPU compute, managed Slurm and Kubernetes, and AI services like inference endpoints and fine-tuning. Published on-demand pricing for raw compute remains limited, and engagements run through sales sized to workload, term, and capacity reservation across data centers.
The AI Services product (inference, fine-tuning, prompt workbench) is self-serve with Stripe-based credit purchases, but the compute infrastructure that anchors the company's revenue continues to be sales-led. Buyers who only need inference endpoints can land on the platform without a sales conversation.
Workloads that match the Nscale fit involve a procurement process measured in weeks or months, capacity reservations spanning multiple regions, and a contract specifying performance and operational guarantees.
Visit nscale.com→
Replicate
Model-hosting platform · Self-serve · Global
Replicate runs a model-hosting platform with a large catalog of open-weights models exposed behind a per-second-billed API. The product surface targets developers who want to call a model the way they call any other API, with cold-start handling, model versioning, and a web playground reducing the friction between idea and working prototype.
H100 capacity prices at $5.49/hr with per-second billing for dedicated hardware, alongside A100 at $5.04/hr and L40S at $3.51/hr. The catalog also includes per-output pricing for popular public models like Flux, Claude, and DeepSeek, which is the dominant usage pattern.
Builders prototyping a product around an open model, or running batch inference jobs at modest volume, tend to find Replicate fastest to value. The deployment experience for custom models, built on Cog (Replicate's open-source packaging tool), means a team can move from "model trained on local machine" to "model running behind a production API" in an afternoon.
Teams that scale into steady high-volume traffic usually evaluate Modal or Fal as the per-request economics shift in their favor.
Visit replicate.com→
RunPod
On-demand and spot · Self-serve · Global
RunPod sells on-demand and spot GPU instances at per-second billing across a fleet that spans B200, H200, H100, A100, L40S, and consumer-class GPUs like the 4090 and 5090. The platform covers more than 30 regions globally, which is the broadest geographic footprint of any provider in the table.
The H100 PCIe rate on Community Cloud prices at $1.99/hr, with H100 SXM at $2.69/hr and the higher-reliability Secure Cloud tier priced above both. Community Cloud uses capacity from independent hosts at lower rates and weaker SLAs, while Secure Cloud runs on RunPod-owned hardware with stronger uptime guarantees.
Three product modes serve different workloads. Pods are persistent GPU rentals priced by the hour at per-second granularity. Serverless adds autoscaling with per-second billing on cold-started workers, with the H100 Serverless rate at $4.18/hr. Clusters launch multi-GPU configurations from 8 to 64 GPUs for training jobs.
Iteration, fine-tuning, and cost-sensitive inference workloads benefit from the fast spin-up and per-second billing. Production teams that need stricter SLAs route through Secure Cloud or move to providers with more rigid reliability guarantees.
Visit runpod.io→
Together AI
Inference and training · Self-serve · US
Together AI runs three distinct products that share the same underlying GPU infrastructure. The Serverless Inference catalog exposes a wide range of open-weights models (Llama, DeepSeek, Qwen, Kimi, GLM) at per-token pricing, which is the company's most-used product. Dedicated Inference rents single-tenant GPU instances by the hour for teams that need guaranteed performance ($6.49/hr H100). GPU Clusters sell multi-node configurations for training and fine-tuning at $5.49/hr per GPU on-demand, with reserved pricing as low as $3.99/hr at 91+ day commitments.
Billing on the dedicated hardware side runs per-minute, with the company's tooling targeting teams who train, fine-tune, and serve their own models on the same platform.
A Batch Inference API runs at 50% discount for workloads that don't need real-time responses, which materially shifts the economics for large-scale data labeling, evaluation, or offline generation jobs.
Customers building on open-weights models tend to start with the Serverless inference catalog and move to dedicated capacity as volume grows. The path from notebook to multi-node cluster on the same provider is a meaningful operational advantage that hyperscalers and pure-inference platforms struggle to match.
Visit together.ai→
Vast.ai
GPU marketplace · Self-serve · Global
Vast.ai operates a marketplace where independent GPU owners list capacity for rent, which makes it structurally different from every other provider in the table. The company doesn't own the hardware. It runs the platform that matches buyers to sellers, charges by the second, and lets the market set the price.
H100 SXM pricing starts around $1.87/hr at the marketplace floor, with median pricing closer to $2.12/hr and the upper quartile at $3.40/hr depending on host reliability score and region. The fleet covers more than 60 GPU types, from current-gen B200 down to consumer 3060s, with the headline floor reflecting the cheapest available hosts at any given moment.
Three instance types serve different workloads. On-demand provides guaranteed uptime at the standard rate. Interruptible runs at 50%+ discount but the host can reclaim capacity. Reserved offers commitment-based discounts up to 50% off.
Researchers, hobbyists, and price-sensitive teams use Vast.ai for training runs and experimentation where uptime guarantees take a back seat to dollar cost. Production workloads generally avoid the marketplace, though the platform's own filters let buyers screen out lower-tier hosts at higher prices. Bandwidth charges apply per byte transferred and vary by host, which can materially affect total cost on data-intensive workloads.
Visit vast.ai→
zCLOUD
Aggregator · Sales-led · Global
zCLOUD aggregates capacity across a network of 40+ underlying GPU providers and presents it as a single procurement surface. The platform sits a layer above the providers in this table rather than alongside them, which makes the comparison structurally different from every other row.
Pricing uses a bid/ask system. Rather than publishing a rate card, zCLOUD prices each request against live availability from the underlying network at the time of purchase, then routes the workload to the best available match. Region constraints, hardware requirements, and term length all enter the routing logic.
The fit covers teams whose requirements shift faster than a single provider's strengths can keep up with. Training workloads that move between fine-tuning runs, inference traffic that spikes unpredictably, region constraints that turn into hard requirements mid-project, or simply the case where the buyer hasn't yet decided which single provider's strengths matter most.
Engagement runs through sales because the routing logic depends on workload specifics that a self-serve checkout can't fully capture. The trade for that procurement step is access to breadth that no single provider can match, including the eleven other providers on this page when they offer the right capacity at the right price.
The site that hosts this comparison is operated by Zettabyte Technology, which also operates zCLOUD. The footer carries the disclosure.
Visit zettabytecloud.com→