Stereo Depth
Category: Lifting (2D → 3D)
Experimental: No
Stereo depth estimation recovers per-pixel metric depth (in metres) from a pair of rectified left/right RGB images by matching corresponding pixels across the two views and applying the stereo geometry formula:
depth_m = baseline_mm × focal_length_px / disparity_px / 1000
vizion3d uses S2M2 (Stereo Matching Model with Multi-scale transformer) as its stereo backend. Unlike Depth Estimation, stereo depth produces real-world metric distances — provided the camera calibration parameters are correct.
Model backends
Default checkpoint download: stereo-depth-s2m2-L.pth
curl -L \
https://github.com/OlafenwaMoses/vizion3D/releases/download/essentials-v1/stereo-depth-s2m2-L.pth \
-o stereo-depth-s2m2-L.pth
| Value | What happens |
|---|---|
| (default) | Downloads the vizion3D release checkpoint (stereo-depth-s2m2-L.pth, the L variant) to ~/.cache/vizion3d/models/ on first use, then loads it |
An HTTPS URL ending in .pth or .pt |
Downloaded to the cache directory on first use, then loaded as an S2M2 checkpoint |
A local .pth or .pt file path |
Loaded directly — no download |
Models are kept in memory after the first inference. Set VIZION3D_MODEL_CACHE to override the cache directory.
S2M2 variants
The S2M2 architecture comes in four size variants. The correct one is detected automatically from the checkpoint filename:
| Variant | Channels | Transformers | Speed | Quality |
|---|---|---|---|---|
S (-S.pth) |
128 | 1 | Fastest | Good |
M (-M.pth) |
192 | 2 | Fast | Better |
L (-L.pth) |
256 | 3 | Balanced | Best (default) |
XL (-XL.pth) |
384 | 3 | Slowest | Best |
Command parameters
StereoDepthCommand is the input contract for this task.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
left_image |
str \| bytes |
Yes | — | Left-camera image. Pass a file path string or raw image bytes. |
right_image |
str \| bytes |
Yes | — | Right-camera image (same resolution, horizontally offset from left_image). |
model_backend |
str |
No | vizion3D release checkpoint URL | S2M2 checkpoint. See Model backends above. |
return_depth_image |
bool |
No | False |
If True, the result includes a 16-bit grayscale Open3D Image of the depth map. |
return_point_cloud |
bool |
No | False |
If True, the result includes an Open3D PointCloud in metres. |
advanced_config |
StereoDepthAdvancedConfig |
No | 1280×720 @ 100 mm baseline defaults | Camera intrinsics and inference settings. See Advanced config below. |
Result fields
StereoDepthResult is the output contract.
| Field | Type | Always present | Description |
|---|---|---|---|
depth_map |
list[list[float]] |
Yes | Metric depth in metres, shape [H][W]. Real-world distances (assuming correct calibration). |
disparity_map |
list[list[float]] |
Yes | Raw disparity in pixels, shape [H][W]. Horizontal pixel offset between matched features. |
min_depth |
float |
Yes | Minimum value in depth_map (metres). |
max_depth |
float |
Yes | Maximum value in depth_map. Guaranteed max_depth >= min_depth. |
backend_used |
str |
Yes | Resolved local file path of the checkpoint used. |
depth_image |
open3d.geometry.Image \| None |
When return_depth_image=True |
16-bit grayscale image, dtype uint16. The full 0–65535 range maps to [min_depth, max_depth] in metres. |
point_cloud |
open3d.geometry.PointCloud \| None |
When return_point_cloud=True |
Coloured 3D point cloud, coordinates in metres. |
point_cloud_scale |
float |
Yes | Always 1.0 — stereo depth produces real metric coordinates. |
1. Direct Python import — image bytes
from vizion3d.stereo import StereoDepth, StereoDepthCommand
with open("left.png", "rb") as f:
left_bytes = f.read()
with open("right.png", "rb") as f:
right_bytes = f.read()
cmd = StereoDepthCommand(left_image=left_bytes, right_image=right_bytes)
result = StereoDepth().run(cmd)
print(f"Depth range : {result.min_depth:.2f} → {result.max_depth:.2f} m")
print(f"Backend : {result.backend_used}")
2. Direct Python import — file paths
from vizion3d.stereo import StereoDepth, StereoDepthCommand
cmd = StereoDepthCommand(
left_image="left.png",
right_image="right.png",
)
result = StereoDepth().run(cmd)
print(f"Depth range: {result.min_depth:.2f} → {result.max_depth:.2f} m")
3. Disparity map
The raw disparity map (in pixels) is always returned alongside the depth map.
import numpy as np
from vizion3d.stereo import StereoDepth, StereoDepthCommand
cmd = StereoDepthCommand(left_image="left.png", right_image="right.png")
result = StereoDepth().run(cmd)
disp = np.array(result.disparity_map)
print(f"Disparity range: {disp.min():.1f} → {disp.max():.1f} px")
4. Depth image (16-bit PNG)
import numpy as np
from PIL import Image as PILImage
from vizion3d.stereo import StereoDepth, StereoDepthCommand
cmd = StereoDepthCommand(
left_image="left.png",
right_image="right.png",
return_depth_image=True,
)
result = StereoDepth().run(cmd)
depth_array = np.asarray(result.depth_image) # shape (H, W), dtype uint16
PILImage.fromarray(depth_array).save("depth.png")
5. Point cloud
Point coordinates are in real metres — point_cloud_scale is always 1.0.
import numpy as np
import open3d as o3d
from vizion3d.stereo import StereoDepth, StereoDepthAdvancedConfig, StereoDepthCommand
cmd = StereoDepthCommand(
left_image="left.png",
right_image="right.png",
return_point_cloud=True,
advanced_config=StereoDepthAdvancedConfig(
focal_length=1733.74,
cx=792.27,
cy=541.89,
baseline=536.62, # mm
),
)
result = StereoDepth().run(cmd)
pcd = result.point_cloud
points = np.asarray(pcd.points) # shape (N, 3), metres
print(f"Points: {len(points):,}")
print(f"Scale : {result.point_cloud_scale} m/unit") # always 1.0
# Real-world distance between two points
dist = np.linalg.norm(points[0] - points[1]) * result.point_cloud_scale
print(f"p0→p1: {dist:.4f} m")
o3d.io.write_point_cloud("scene.ply", pcd)
6. All outputs at once
import numpy as np
import open3d as o3d
from vizion3d.stereo import StereoDepth, StereoDepthCommand
cmd = StereoDepthCommand(
left_image="left.png",
right_image="right.png",
return_depth_image=True,
return_point_cloud=True,
)
result = StereoDepth().run(cmd)
print(f"Depth range : {result.min_depth:.2f} → {result.max_depth:.2f} m")
depth_arr = np.asarray(result.depth_image) # uint16 (H, W)
o3d.io.write_point_cloud("scene.ply", result.point_cloud)
7. Speed vs quality: scale factor
Use scale_factor < 1.0 to downsample input before inference for faster results:
from vizion3d.stereo import StereoDepth, StereoDepthAdvancedConfig, StereoDepthCommand
cmd = StereoDepthCommand(
left_image="left.png",
right_image="right.png",
advanced_config=StereoDepthAdvancedConfig(
scale_factor=0.5, # half-resolution → ~3–4× faster
),
)
result = StereoDepth().run(cmd)
8. REST API
Start the server with all REST features enabled:
uv run vizion3d-serve-rest
To preload the stereo checkpoint into memory at startup, pass --stereo_model.
This also enables the stereo-depth endpoint. If this flag is omitted, the
default vizion3D release model is downloaded on first inference and cached under
~/.cache/vizion3d/models/.
uv run vizion3d-serve-rest \
--stereo_model /models/stereo-depth-s2m2-L.pth
The REST server can expose only selected features. If none of
--depth_estimation, --stereo_depth, --depth_model, or --stereo_model is
provided, all features are enabled. If any of those flags is provided, only the
selected features are enabled. A model path flag selects and preloads its
feature:
# Only POST /lifting/stereo-depth
uv run vizion3d-serve-rest --stereo_depth
# Only stereo depth, with the model loaded before the first request
uv run vizion3d-serve-rest \
--stereo_depth \
--stereo_model /models/stereo-depth-s2m2-L.pth
# Enable both depth estimation and stereo depth explicitly
uv run vizion3d-serve-rest \
--depth_estimation \
--stereo_depth \
--depth_model /models/depth_anything_v2_vitb.pth \
--stereo_model /models/stereo-depth-s2m2-L.pth
Send a request with two image files:
curl -X POST "http://localhost:8000/lifting/stereo-depth" \
-F "left_image=@left.png" \
-F "right_image=@right.png" \
-F "focal_length=1733.74" \
-F "baseline=536.62" \
-F "cx=792.27" \
-F "cy=541.89" \
-F "return_point_cloud=true"
The response is a JSON-serialised StereoDepthResult. Binary fields (depth_image, point_cloud_ply) are base64-encoded.
9. gRPC API
Start the server:
uv run vizion3d-serve-grpc
Call from a gRPC client:
import grpc
from vizion3d.proto import lifting_pb2, lifting_pb2_grpc
channel = grpc.insecure_channel("localhost:50051")
stub = lifting_pb2_grpc.LiftingServiceStub(channel)
with open("left.png", "rb") as f:
left_bytes = f.read()
with open("right.png", "rb") as f:
right_bytes = f.read()
request = lifting_pb2.StereoDepthRequest(
left_image_bytes=left_bytes,
right_image_bytes=right_bytes,
return_point_cloud=True,
advanced_config=lifting_pb2.StereoDepthAdvancedConfig(
focal_length=1733.74,
baseline=536.62,
cx=792.27,
cy=541.89,
),
)
response = stub.RunStereoDepth(request)
print(f"Min depth : {response.min_depth:.2f} m")
print(f"Max depth : {response.max_depth:.2f} m")
print(f"Backend : {response.backend_used}")
Advanced config
StereoDepthAdvancedConfig supplies the camera calibration needed for accurate metric depth.
| Field | Type | Default | Description |
|---|---|---|---|
focal_length |
float |
1000.0 |
Focal length in pixels (assumes fx = fy). Override with your calibration. |
cx |
float |
640.0 |
Principal point x (pixel column of optical axis). |
cy |
float |
360.0 |
Principal point y (pixel row of optical axis). |
baseline |
float |
100.0 |
Stereo baseline in millimetres. |
doffs |
float |
0.0 |
Disparity offset (non-zero for Middlebury-style calibration). |
z_far |
float |
10.0 |
Max depth in metres for point cloud. |
conf_threshold |
float |
0.1 |
Min per-pixel confidence score for point cloud inclusion. |
occ_threshold |
float |
0.5 |
Min occlusion score for point cloud inclusion. |
scale_factor |
float |
1.0 |
Input downscale factor (0.5 = half-res, ~3–4× faster). |
How to obtain camera intrinsics
From a calibration file (e.g. Middlebury):
# calib.txt format: cam0=[fx 0 cx; 0 fy cy; 0 0 1]
# baseline=B (mm), doffs=d
from vizion3d.stereo import StereoDepthAdvancedConfig
cfg = StereoDepthAdvancedConfig(
focal_length=1733.74, # from calib.txt
cx=792.27,
cy=541.89,
baseline=536.62, # B in mm
doffs=0.0, # d from calib.txt
)
From Intel RealSense SDK:
import pyrealsense2 as rs
pipeline = rs.pipeline()
profile = pipeline.start()
left_stream = profile.get_stream(rs.stream.infrared, 1)
intrinsics = left_stream.as_video_stream_profile().get_intrinsics()
cfg = StereoDepthAdvancedConfig(
focal_length=intrinsics.fx,
cx=intrinsics.ppx,
cy=intrinsics.ppy,
baseline=50.0, # RealSense D435 baseline ≈ 50 mm
)
Approximation from field of view:
import math
hfov_deg = 90.0 # horizontal FOV from camera spec
image_width = 1280
focal_length = image_width / (2 * math.tan(math.radians(hfov_deg / 2)))
cfg = StereoDepthAdvancedConfig(
focal_length=focal_length,
cx=image_width / 2 - 0.5,
cy=720 / 2 - 0.5,
baseline=100.0,
)
Known limitations
- Rectified pairs required — images must be stereo-rectified so corresponding points lie on the same horizontal scanline. Un-rectified pairs will produce incorrect results.
- Metric scale depends on calibration — an incorrect
baselineorfocal_lengthscales all depth values uniformly. Always use calibrated values for real applications. - Python 3.12 required for Open3D —
return_depth_imageandreturn_point_cloudrequire Open3D, which currently only supports Python 3.12 in this project.