Welcome to star this project❤
Back to MFU Calculator
DeepSeek V3 MFU Calculator
Model Arguments
Save Parameters
Load Parameters
Global Batch Size:
Max Batch Size:
Max Sequence Length:
Sequence Length:
Data Type:
bf16
fp8
Vocabulary Size:
Model Dimension:
Intermediate Dimension:
MoE Intermediate Dimension:
Number of Layers:
Number of Dense Layers:
Number of Heads:
Number of MTP Modules:
Number of Routed Experts:
Number of Shared Experts:
Number of Activated Experts:
Number of Expert Groups:
Number of Limited Groups:
Q LoRA Rank:
KV LoRA Rank:
QK NoPE Head Dimension:
QK RoPE Head Dimension:
V Head Dimension:
Causal Mask:
False
True
Calculation Parameters
Step Time(s):
World Size(gpu_nums):
GPU Peak BF16 FLOPS(TFlops):
Calculate
Results
MFU:
-
Total FLOPS:
-
Calculation Formulas
Embedding Layer
embedding_flops = 2 * gbs * seq_len * dim * vocab_size
MLA (Multi-Head Latent Attention)
q_down_proj = 2 * gbs * seq_len * hidden_size * q_lora_rank q_up_proj = 2 * gbs * seq_len * q_lora_rank * num_heads * qk_head_dim q_linear = q_down_proj + q_up_proj kv_down_proj = 2 * gbs * seq_len * hidden_size * (kv_lora_rank + qk_rope_head_dim) kv_up_proj = 2 * gbs * seq_len * kv_lora_rank * num_heads * (qk_head_dim + v_head_dim) kv_linear = kv_down_proj + kv_up_proj // When causal_mask is True: kv_scores = (2 * gbs * seq_len² * num_heads * qk_head_dim) / (causal_mask ? 2 : 1) qkv = (2 * gbs * seq_len² * num_heads * v_head_dim) / (causal_mask ? 2 : 1) out_linear = 2 * gbs * seq_len * n_heads * v_head_dim * hidden_size
MoE Layer
linear_layer_flops = 2 * 3 * gbs * seq_len * hidden_size * moe_inter_dim route_flops = 2 * gbs * seq_len * hidden_size * n_routed_experts moe_layer_flops = linear_layer_flops * (n_shared_experts + n_activated_experts) + route_flops
MLP Layer
mlp_flops = 2 * 3 * gbs * seq_len * hidden_size * inter_dim
Total FLOPS
main_model_flops = 3 * (embedding_flops + moe_layers * (mla_layer_flops + moe_layer_flops) + n_dense_layers * (mla_layer_flops + mlp_layer_flops)) mtp_flops = 3 * (embedding_flops + mla_layer_flops + moe_layer_flops + linear_proj) total_flops = main_model_flops + mtp_flops * n_mtp_modules
MFU (Model FLOPS Utilization)
mfu = total_flops / (world_size * step_time * 10¹²) / gpu_peak_bf16_flops