What is surgical object detection?

Surgical object detection (SOD) refers to identifying objects in endoscopic surgery videos. From a machine learning perspective, it can be formulated as a multi-class classification problem, where the model assigns a probability to each possible class. This has important applications in both real-time analysis during surgery and post-operative review of recorded videos.

Real-time applications

In case of real-time detection, the goal is to provide a navigational tool to the surgeon to avoid serious risk and potential patient harm. This is achieved by providing real-time information about the tool location, trajectory, and may even provide distance measurements to critical structures.

This technology can also help to avoid leaving behind any surgical instruments before the healthcare staff close the patient. The system can achieve this task by counting each unique individual item that enters and leaves the surgical workspace.

Offline application

One of the main offline applications of SOD is skill assessment. Here the goal is to understand how a particular surgeon performs a step in a procedure when compared to their peers in the same specialty. This helps to analyze their workflow and suggest some finetuning techniques to reduce precedure time, and improve outcomes.

Convolutional Neural Networks vs. Vision Transformers for Object Detection

In the rapidly evolving landscape of computer vision, both CNN and Vision Transformers (ViTs) have emerged as powerful tools for a variety of tasks, including multi-class classification problems. While CNN is renowned for its versatility in object detection and segmentation, Vision Transformers have revolutionized image classification by applying the self-attention mechanism, traditionally used in natural language processing.

Performance and Benchmarks

Numerous studies and benchmarks have compared the performance of CNN-based models and Vision Transformers on various multi-class classification tasks, with the ImageNet dataset being a primary benchmark.

Accuracy: On large-scale datasets like ImageNet, well-trained Vision Transformers have demonstrated the ability to outperform state-of-the-art CNNs. The global attention mechanism of ViTs allows them to capture broader contextual information, which can be advantageous for datasets with diverse and complex scenes. However, on smaller datasets, CNNs often exhibit better performance due to their inherent inductive biases, which provide a stronger “prior” for image data. ViTs, lacking these biases, generally require more data to learn visual patterns effectively.

Data Efficiency: CNNs are generally more data-efficient than ViTs. The convolutional and pooling layers in CNNs enforce a spatial hierarchy and translation equivariance, which are beneficial when training data is limited. In contrast, ViTs have a more flexible and expressive architecture but require substantial amounts of data to learn these fundamental properties of images from scratch. Transfer learning from large pre-trained datasets can significantly mitigate this issue for ViTs.

Computational Cost: The computational complexity of ViTs scales quadratically with the number of patches (and thus, image resolution). This can make them computationally expensive for high-resolution images. CNNs, with their localized receptive fields, can be more efficient in this regard. However, advancements in ViT architectures, such as the Swin Transformer which uses shifted windows for attention, have addressed some of these efficiency concerns.

Robustness and Generalization: Some studies suggest that ViTs may exhibit better robustness to occlusions and domain shifts compared to CNNs. Their ability to attend to a global context can make them less reliant on specific local features that might be absent or altered in out-of-distribution data.

At Encinalabs, we believe medical computer vision will redefine how surgical and diagnostic tools are built — making them powerful, collaborative, and fundamentally private.

If you are working on medical computer vision AI, we would love to connect. Let’s build the future together.