Just finished benchmarking nvidia/llama-embed-nemotron-8b across 4 GPU generations (RTX 5090, RTX 4090, RTX 3090, and a $60 P102-100 mining card) and my first results were completely wrong.
Here is what happened.
I started with what seemed like a solid benchmark: padding to max_length=512, counting tokens via attention_mask.numel(), torch.compile with reduce-overhead. The dashboard showed 13,000 TPS. Impressive, right?
Except it was fiction.
What is nvidia/llama-embed-nemotron-8b?
nvidia/llama-embed-nemotron-8b is NVIDIA’s 8-billion-parameter text embedding model, built on the Llama architecture and released through the Nemotron family. It produces dense vectors for retrieval, RAG, and semantic search. Compared to closed APIs like OpenAI text-embedding-3-large or Gemini text-embedding-004, it runs locally on consumer GPUs (the 32 GB RTX 5090 fits it comfortably in bf16). Compared to open-weights peers like BAAI/bge-m3, it is a much larger model that trades latency for quality on long-form retrieval. This post is about how fast it actually runs, once you measure honestly.
The problems I found
numel()counts padding tokens as real work. A 50-token text padded to 512 inflates your TPS by 10x.padding="max_length"forces the GPU to process 462 zeros per short text, burning compute on nothing.- Compiling
average_poolseparately from the model caused CUDA Graph buffer conflicts underreduce-overheadmode. torch.cuda.amp.autocastwrapping an already-bf16 model adds dispatcher overhead on every operation.DataLoaderwithnum_workers>0for string passthrough adds IPC cost with zero benefit.
The honest benchmark
After fixing the methodology (dynamic padding, honest token counting via attention_mask.sum(), proper torch.compile warmup with cudagraph_mark_step_begin(), per-batch timing with p5/p95 percentiles), here is the real picture.
Nemotron-8B, bf16, dynamic padding, mean text length ~73 tokens:
| GPU | Compute | VRAM | TPS (median) | Texts/sec |
|---|---|---|---|---|
| RTX 5090 | SM_120 | 32 GB | 5,100 | ~137 |
| RTX 4090 | SM_89 | 24 GB | 3,400 | ~93 |
| RTX 3090 | SM_86 | 24 GB | 1,580 | ~43 |
| P102-100 (4-bit NF4) | SM_61 | 10 GB | 210 | ~4 |
Key findings
- TPS plateaus at
batch_size=32on all cards. Bigger batches only eat VRAM. - Padding overhead is 2.6x across the board. Sorting texts by length before batching could push real throughput 30 to 40% higher.
- The 5090 is 3.2x faster than the 3090, not the 10x you would expect from spec sheets. Memory bandwidth is the bottleneck, not compute.
- P102-100 (a $60 mining card) actually runs an 8B model in 4-bit. 210 TPS is slow but functional for small-scale indexing.
- For 100M+ document embeddings, parallelizing across multiple mid-tier GPUs beats a single flagship. You get linear scaling plus free I/O parallelism.
The naive benchmark told us we were fast. The honest benchmark told us where to actually optimize.
Curious to hear from the NVIDIA NeMo team: what is the recommended inference setup for Nemotron-8B at scale (not the A100 setup)? Any plans for TensorRT-LLM optimization for embedding workloads?
FAQ
What is nvidia/llama-embed-nemotron-8b?
It is NVIDIA’s 8B-parameter text embedding model from the Nemotron family, built on the Llama architecture. It outputs dense vectors used for retrieval, RAG pipelines, and semantic search. It is an open-weights alternative to closed APIs like OpenAI text-embedding-3 and Google Gemini text-embedding-004, and a larger, higher-quality counterpart to BAAI/bge-m3.
Can I run Nemotron 8B locally on a consumer GPU?
Yes. The RTX 5090 (32 GB) runs it comfortably in bf16 at ~5,100 tokens per second. The RTX 4090 and 3090 (24 GB) also work in bf16. On 10 to 12 GB cards you need 4-bit NF4 quantization (the P102-100 result above shows it works, just slowly).
Why was my first benchmark reporting 13,000 TPS when the real number is ~5,000?
Three reasons stack: counting padding tokens with attention_mask.numel() instead of attention_mask.sum(), using static padding="max_length" instead of dynamic padding, and a broken torch.compile setup where average_pool was compiled separately from the model under reduce-overhead. Together they inflate the reported throughput by roughly 10x without changing actual work done.
How does the RTX 5090 compare to the RTX 4090 for LLM embedding inference?
For Nemotron-8B in bf16, the 5090 hits ~5,100 TPS median vs the 4090’s ~3,400 TPS, a 1.5x speedup. The 5090 vs 3090 gap is 3.2x. The gap is dominated by memory bandwidth, not raw compute. The 32 GB of VRAM on the 5090 also lets you keep the full bf16 model and larger batches resident without offloading.
Is it better to buy one RTX 5090 or two RTX 4090s for embedding workloads?
For embedding indexing at scale (100M+ documents), two RTX 4090s beat one RTX 5090: you get linear throughput scaling and parallel I/O for free, at lower cost per token. For interactive single-request latency, one 5090 is better because there is no sharding overhead.
Can a P102-100 mining card really run an 8B embedding model?
Yes, in 4-bit NF4 quantization. Throughput is ~210 TPS, about 24x slower than an RTX 5090. It is too slow for anything interactive, but it does work for small-scale offline indexing if you already own the card. The Pascal architecture (SM_61) is the actual ceiling: no native bf16, no Flash Attention.
Why does TPS plateau at batch_size=32?
At batch 32 the GPU’s compute is already saturated by the matrix multiplications inside each transformer layer. Larger batches do not feed more parallel work, they just hold more activations in VRAM. The bottleneck shifts to memory bandwidth, which is why the 5090’s 1792 GB/s bandwidth (vs 936 GB/s on the 3090) explains most of the throughput gap.