Floating-point

IEEE 754 encoding with sign, characteristic (Excess-q) and mantissa (fixed-point). Constant relative error.

n (bit length)k (fractional bits)Number x

Configure your inputs and press Compute to see the step-by-step computation.

How it works

Floating-point c_GK,k,n encodes a real in normalised form m * 2^e: 1 sign bit, n-k characteristic bits (Excess-q with q = 2^n-k-1 - 1), k-1 mantissa bits (fixed-point without leading 1). Reserved bit patterns for zero, ±infinity, NaN and subnormals. Absolute rounding error grows with the exponent, but relative error stays bounded by 2^-k.

Rounding error

For floating-point encoding the absolute error grows with the exponent, but the relative error stays bounded by a constant — equal precision across all magnitudes.

Maximum absolute error: 2^e / 2^k (worst case bei e = q = 1: ≈ 0.03125)
Maximum relative error: 1 / 2^k = 1 / 2⁶ = 0.015625 (konstant)

When to use

In C provided as float (binary32, k=24, n=32) and double (binary64, k=53, n=64). Because of catastrophic cancellation and non-associative arithmetic, sum numbers grouped by magnitude. Always compare with a tolerance instead of ==.