Skip to content

OWLv2 Open-World Localization#

This example demonstrates using the Owlv2 tool for object detection in images based on text prompts.

from vision_agent_tools.models.owlv2 import Owlv2

# (replace this path with your own!)
test_image = "path/to/your/image.jpg"

# What are you looking for? Write your detective prompts here!
prompts = ["a photo of a cat", "a photo of a dog"]

# Load the image and create your Owlv2 detective tool
image =
owlv2 = Owlv2()

# Time to put Owlv2 to work! Let's see what it finds...
results = owlv2(image, prompts=prompts)[0]

# Did Owlv2 sniff out any objects? Let's see the results!
if results:
    for detection in results:
        print(f"Found it! It looks like a {detection['label']} with a confidence of {detection['score']:.2f}.")
        print(f"Here's where it's hiding: {detection['bbox']}")
    print("Hmm, Owlv2 couldn't find anything this time. Maybe try a different prompt?")

Owlv2 #

Bases: BaseMLModel

Tool for object detection using the pre-trained Owlv2 model from Transformers.

This tool takes images and a prompt as input, performs object detection using the Owlv2 model, and returns a list of objects containing the predicted labels, confidence scores, and bounding boxes for detected objects with confidence exceeding a threshold.

__call__(prompts, images=None, video=None, *, batch_size=1, nms_threshold=0.3, confidence=0.1) #

Performs object detection on images using the Owlv2 model.


Name Type Description Default
prompts list[str]

The prompt to be used for object detection.

images list[Image] | None

The images to be analyzed.

video VideoNumpy[uint8] | None

A numpy array containing the different images, representing the video.

batch_size int

The batch size used for processing multiple images or video frames.

nms_threshold float

The IoU threshold value used to apply a dummy agnostic Non-Maximum Suppression (NMS).

confidence float

Confidence threshold for model predictions.



Type Description

list[ODWithScoreResponse]: A list of ODWithScoreResponse objects containing the predicted labels, confidence scores, and bounding boxes for detected objects with confidence exceeding the threshold. The item will be None if no objects are detected above the confidence threshold for an specific image / frame.

__init__(model_config=OWLV2Config()) #

Loads the pre-trained Owlv2 processor and model from Transformers.

Owlv2ProcessorWithNMS #

Bases: Owlv2Processor

post_process_object_detection_with_nms(outputs, *, threshold=0.1, nms_threshold=0.3, target_sizes=None) #

Converts the raw output of [OwlViTForObjectDetection] into final bounding boxes in (top_left_x, top_left_y, bottom_right_x, bottom_right_y) format.


Name Type Description Default
outputs OwlViTObjectDetectionOutput

Raw outputs of the model.

threshold float

Score threshold to keep object detection predictions.

nms_threshold float

IoU threshold to filter overlapping objects the raw detections.

target_sizes TensorType | list[Tuple] | None

Tensor of shape (batch_size, 2) or list of tuples (Tuple[int, int]) containing the target size (height, width) of each image in the batch. If unset, predictions will not be resized.


Returns: list[dict]: A list of dictionaries, each dictionary containing the scores, labels and boxes for an image in the batch as predicted by the model.