LLMs still do not locate bounding boxes well

I sent an image to over a dozen LLMs that support vision, asking them:

Detect objects in this 1280×720 px image and return their color and bounding boxes in pixels. Respond as a JSON object: {[label]: [color, x1, y1, x2, y2], …}

None of the models did a good-enough job. It looks like we have some time to go before LLMs become good at bounding boxes.

I’ve given them a subjective rating on a 1-5 scale below.

ModelPositionsSizes
gemini-1.5-flash-001πŸŸ’πŸŸ’πŸŸ’πŸ”΄πŸ”΄πŸŸ’πŸŸ’πŸŸ’πŸŸ’πŸ”΄
gemini-1.5-flash-8bπŸŸ’πŸŸ’πŸŸ’πŸ”΄πŸ”΄πŸŸ’πŸŸ’πŸŸ’πŸ”΄πŸ”΄
gemini-1.5-flash-002πŸŸ’πŸŸ’πŸ”΄πŸ”΄πŸ”΄πŸŸ’πŸŸ’πŸŸ’πŸ”΄πŸ”΄
gemini-1.5-pro-002πŸŸ’πŸŸ’πŸŸ’πŸ”΄πŸ”΄πŸŸ’πŸŸ’πŸŸ’πŸŸ’πŸ”΄
gpt-4o-miniπŸŸ’πŸ”΄πŸ”΄πŸ”΄πŸ”΄πŸŸ’πŸŸ’πŸ”΄πŸ”΄πŸ”΄
gpt-4oπŸŸ’πŸŸ’πŸŸ’πŸŸ’πŸ”΄πŸŸ’πŸŸ’πŸŸ’πŸŸ’πŸ”΄
chatgpt-4o-latestπŸŸ’πŸŸ’πŸŸ’πŸŸ’πŸ”΄πŸŸ’πŸŸ’πŸŸ’πŸŸ’πŸ”΄
claude-3-haiku-20240307πŸŸ’πŸ”΄πŸ”΄πŸ”΄πŸ”΄πŸŸ’πŸŸ’πŸ”΄πŸ”΄πŸ”΄
claude-3=5-sonnet-20241022πŸŸ’πŸŸ’πŸŸ’πŸ”΄πŸ”΄πŸŸ’πŸŸ’πŸŸ’πŸ”΄πŸ”΄
llama-3.2-11b-vision-previewπŸ”΄πŸ”΄πŸ”΄πŸ”΄πŸ”΄πŸ”΄πŸ”΄πŸ”΄πŸ”΄πŸ”΄
llama-3.2-90b-vision-previewπŸŸ’πŸŸ’πŸŸ’πŸ”΄πŸ”΄πŸŸ’πŸŸ’πŸŸ’πŸ”΄πŸ”΄
qwen-2-vl-72b-instructπŸŸ’πŸŸ’πŸŸ’πŸ”΄πŸ”΄πŸŸ’πŸŸ’πŸ”΄πŸ”΄πŸ”΄
pixtral-12bπŸŸ’πŸŸ’πŸ”΄πŸ”΄πŸ”΄πŸŸ’πŸŸ’πŸŸ’πŸ”΄πŸ”΄

I used an app I built for this.

Here is the original image along with the individual results.

Update

Adding gridlines with labeled axes helps the LLMs. (Thanks @Bijan Mishra.) Here are a few examples:

Leave a Comment

Your email address will not be published. Required fields are marked *