Skip to content

Stereo Depth

Category: Lifting (2D → 3D)
Experimental: No

Stereo depth estimation recovers per-pixel metric depth (in metres) from a pair of rectified left/right RGB images by matching corresponding pixels across the two views and applying the stereo geometry formula:

depth_m = baseline_mm × focal_length_px / disparity_px / 1000

vizion3d uses S2M2 (Stereo Matching Model with Multi-scale transformer) as its stereo backend. Unlike Depth Estimation, stereo depth produces real-world metric distances — provided the camera calibration parameters are correct.


Model backends

Default checkpoint download: stereo-depth-s2m2-L.pth

curl -L \
  https://github.com/OlafenwaMoses/vizion3D/releases/download/essentials-v1/stereo-depth-s2m2-L.pth \
  -o stereo-depth-s2m2-L.pth
Value What happens
(default) Downloads the vizion3D release checkpoint (stereo-depth-s2m2-L.pth, the L variant) to ~/.cache/vizion3d/models/ on first use, then loads it
An HTTPS URL ending in .pth or .pt Downloaded to the cache directory on first use, then loaded as an S2M2 checkpoint
A local .pth or .pt file path Loaded directly — no download

Models are kept in memory after the first inference. Set VIZION3D_MODEL_CACHE to override the cache directory.

S2M2 variants

The S2M2 architecture comes in four size variants. The correct one is detected automatically from the checkpoint filename:

Variant Channels Transformers Speed Quality
S (-S.pth) 128 1 Fastest Good
M (-M.pth) 192 2 Fast Better
L (-L.pth) 256 3 Balanced Best (default)
XL (-XL.pth) 384 3 Slowest Best

Command parameters

StereoDepthCommand is the input contract for this task.

Parameter Type Required Default Description
left_image str \| bytes Yes Left-camera image. Pass a file path string or raw image bytes.
right_image str \| bytes Yes Right-camera image (same resolution, horizontally offset from left_image).
model_backend str No vizion3D release checkpoint URL S2M2 checkpoint. See Model backends above.
return_depth_image bool No False If True, the result includes a 16-bit grayscale Open3D Image of the depth map.
return_point_cloud bool No False If True, the result includes an Open3D PointCloud in metres.
advanced_config StereoDepthAdvancedConfig No 1280×720 @ 100 mm baseline defaults Camera intrinsics and inference settings. See Advanced config below.

Result fields

StereoDepthResult is the output contract.

Field Type Always present Description
depth_map list[list[float]] Yes Metric depth in metres, shape [H][W]. Real-world distances (assuming correct calibration).
disparity_map list[list[float]] Yes Raw disparity in pixels, shape [H][W]. Horizontal pixel offset between matched features.
min_depth float Yes Minimum value in depth_map (metres).
max_depth float Yes Maximum value in depth_map. Guaranteed max_depth >= min_depth.
backend_used str Yes Resolved local file path of the checkpoint used.
depth_image open3d.geometry.Image \| None When return_depth_image=True 16-bit grayscale image, dtype uint16. The full 0–65535 range maps to [min_depth, max_depth] in metres.
point_cloud open3d.geometry.PointCloud \| None When return_point_cloud=True Coloured 3D point cloud, coordinates in metres.
point_cloud_scale float Yes Always 1.0 — stereo depth produces real metric coordinates.

1. Direct Python import — image bytes

from vizion3d.stereo import StereoDepth, StereoDepthCommand

with open("left.png", "rb") as f:
    left_bytes = f.read()
with open("right.png", "rb") as f:
    right_bytes = f.read()

cmd = StereoDepthCommand(left_image=left_bytes, right_image=right_bytes)
result = StereoDepth().run(cmd)

print(f"Depth range : {result.min_depth:.2f} → {result.max_depth:.2f} m")
print(f"Backend     : {result.backend_used}")

2. Direct Python import — file paths

from vizion3d.stereo import StereoDepth, StereoDepthCommand

cmd = StereoDepthCommand(
    left_image="left.png",
    right_image="right.png",
)
result = StereoDepth().run(cmd)

print(f"Depth range: {result.min_depth:.2f} → {result.max_depth:.2f} m")

3. Disparity map

The raw disparity map (in pixels) is always returned alongside the depth map.

import numpy as np
from vizion3d.stereo import StereoDepth, StereoDepthCommand

cmd = StereoDepthCommand(left_image="left.png", right_image="right.png")
result = StereoDepth().run(cmd)

disp = np.array(result.disparity_map)
print(f"Disparity range: {disp.min():.1f} → {disp.max():.1f} px")

4. Depth image (16-bit PNG)

import numpy as np
from PIL import Image as PILImage
from vizion3d.stereo import StereoDepth, StereoDepthCommand

cmd = StereoDepthCommand(
    left_image="left.png",
    right_image="right.png",
    return_depth_image=True,
)
result = StereoDepth().run(cmd)

depth_array = np.asarray(result.depth_image)   # shape (H, W), dtype uint16
PILImage.fromarray(depth_array).save("depth.png")

5. Point cloud

Point coordinates are in real metrespoint_cloud_scale is always 1.0.

import numpy as np
import open3d as o3d
from vizion3d.stereo import StereoDepth, StereoDepthAdvancedConfig, StereoDepthCommand

cmd = StereoDepthCommand(
    left_image="left.png",
    right_image="right.png",
    return_point_cloud=True,
    advanced_config=StereoDepthAdvancedConfig(
        focal_length=1733.74,
        cx=792.27,
        cy=541.89,
        baseline=536.62,   # mm
    ),
)
result = StereoDepth().run(cmd)

pcd = result.point_cloud
points = np.asarray(pcd.points)               # shape (N, 3), metres
print(f"Points: {len(points):,}")
print(f"Scale : {result.point_cloud_scale} m/unit")  # always 1.0

# Real-world distance between two points
dist = np.linalg.norm(points[0] - points[1]) * result.point_cloud_scale
print(f"p0→p1: {dist:.4f} m")

o3d.io.write_point_cloud("scene.ply", pcd)

6. All outputs at once

import numpy as np
import open3d as o3d
from vizion3d.stereo import StereoDepth, StereoDepthCommand

cmd = StereoDepthCommand(
    left_image="left.png",
    right_image="right.png",
    return_depth_image=True,
    return_point_cloud=True,
)
result = StereoDepth().run(cmd)

print(f"Depth range : {result.min_depth:.2f} → {result.max_depth:.2f} m")
depth_arr = np.asarray(result.depth_image)    # uint16 (H, W)
o3d.io.write_point_cloud("scene.ply", result.point_cloud)

7. Speed vs quality: scale factor

Use scale_factor < 1.0 to downsample input before inference for faster results:

from vizion3d.stereo import StereoDepth, StereoDepthAdvancedConfig, StereoDepthCommand

cmd = StereoDepthCommand(
    left_image="left.png",
    right_image="right.png",
    advanced_config=StereoDepthAdvancedConfig(
        scale_factor=0.5,   # half-resolution → ~3–4× faster
    ),
)
result = StereoDepth().run(cmd)

8. REST API

Start the server with all REST features enabled:

uv run vizion3d-serve-rest

To preload the stereo checkpoint into memory at startup, pass --stereo_model. This also enables the stereo-depth endpoint. If this flag is omitted, the default vizion3D release model is downloaded on first inference and cached under ~/.cache/vizion3d/models/.

uv run vizion3d-serve-rest \
  --stereo_model /models/stereo-depth-s2m2-L.pth

The REST server can expose only selected features. If none of --depth_estimation, --stereo_depth, --depth_model, or --stereo_model is provided, all features are enabled. If any of those flags is provided, only the selected features are enabled. A model path flag selects and preloads its feature:

# Only POST /lifting/stereo-depth
uv run vizion3d-serve-rest --stereo_depth

# Only stereo depth, with the model loaded before the first request
uv run vizion3d-serve-rest \
  --stereo_depth \
  --stereo_model /models/stereo-depth-s2m2-L.pth

# Enable both depth estimation and stereo depth explicitly
uv run vizion3d-serve-rest \
  --depth_estimation \
  --stereo_depth \
  --depth_model /models/depth_anything_v2_vitb.pth \
  --stereo_model /models/stereo-depth-s2m2-L.pth

Send a request with two image files:

curl -X POST "http://localhost:8000/lifting/stereo-depth" \
  -F "left_image=@left.png" \
  -F "right_image=@right.png" \
  -F "focal_length=1733.74" \
  -F "baseline=536.62" \
  -F "cx=792.27" \
  -F "cy=541.89" \
  -F "return_point_cloud=true"

The response is a JSON-serialised StereoDepthResult. Binary fields (depth_image, point_cloud_ply) are base64-encoded.


9. gRPC API

Start the server:

uv run vizion3d-serve-grpc

Call from a gRPC client:

import grpc
from vizion3d.proto import lifting_pb2, lifting_pb2_grpc

channel = grpc.insecure_channel("localhost:50051")
stub = lifting_pb2_grpc.LiftingServiceStub(channel)

with open("left.png", "rb") as f:
    left_bytes = f.read()
with open("right.png", "rb") as f:
    right_bytes = f.read()

request = lifting_pb2.StereoDepthRequest(
    left_image_bytes=left_bytes,
    right_image_bytes=right_bytes,
    return_point_cloud=True,
    advanced_config=lifting_pb2.StereoDepthAdvancedConfig(
        focal_length=1733.74,
        baseline=536.62,
        cx=792.27,
        cy=541.89,
    ),
)
response = stub.RunStereoDepth(request)
print(f"Min depth : {response.min_depth:.2f} m")
print(f"Max depth : {response.max_depth:.2f} m")
print(f"Backend   : {response.backend_used}")

Advanced config

StereoDepthAdvancedConfig supplies the camera calibration needed for accurate metric depth.

Field Type Default Description
focal_length float 1000.0 Focal length in pixels (assumes fx = fy). Override with your calibration.
cx float 640.0 Principal point x (pixel column of optical axis).
cy float 360.0 Principal point y (pixel row of optical axis).
baseline float 100.0 Stereo baseline in millimetres.
doffs float 0.0 Disparity offset (non-zero for Middlebury-style calibration).
z_far float 10.0 Max depth in metres for point cloud.
conf_threshold float 0.1 Min per-pixel confidence score for point cloud inclusion.
occ_threshold float 0.5 Min occlusion score for point cloud inclusion.
scale_factor float 1.0 Input downscale factor (0.5 = half-res, ~3–4× faster).

How to obtain camera intrinsics

From a calibration file (e.g. Middlebury):

# calib.txt format: cam0=[fx 0 cx; 0 fy cy; 0 0 1]
# baseline=B (mm), doffs=d
from vizion3d.stereo import StereoDepthAdvancedConfig

cfg = StereoDepthAdvancedConfig(
    focal_length=1733.74,   # from calib.txt
    cx=792.27,
    cy=541.89,
    baseline=536.62,        # B in mm
    doffs=0.0,              # d from calib.txt
)

From Intel RealSense SDK:

import pyrealsense2 as rs

pipeline = rs.pipeline()
profile = pipeline.start()
left_stream = profile.get_stream(rs.stream.infrared, 1)
intrinsics = left_stream.as_video_stream_profile().get_intrinsics()

cfg = StereoDepthAdvancedConfig(
    focal_length=intrinsics.fx,
    cx=intrinsics.ppx,
    cy=intrinsics.ppy,
    baseline=50.0,   # RealSense D435 baseline ≈ 50 mm
)

Approximation from field of view:

import math

hfov_deg = 90.0  # horizontal FOV from camera spec
image_width = 1280
focal_length = image_width / (2 * math.tan(math.radians(hfov_deg / 2)))

cfg = StereoDepthAdvancedConfig(
    focal_length=focal_length,
    cx=image_width / 2 - 0.5,
    cy=720 / 2 - 0.5,
    baseline=100.0,
)

Known limitations

  • Rectified pairs required — images must be stereo-rectified so corresponding points lie on the same horizontal scanline. Un-rectified pairs will produce incorrect results.
  • Metric scale depends on calibration — an incorrect baseline or focal_length scales all depth values uniformly. Always use calibrated values for real applications.
  • Python 3.12 required for Open3Dreturn_depth_image and return_point_cloud require Open3D, which currently only supports Python 3.12 in this project.