Test collections

History

2003

2004

2005

Content-Based Retrieval Track

Overview

The purpose of this track is to evaluate content-based image retrieval (CBIR) from generic color photo collection with heterogeneous content, which can be found in personal photoarchives.

The objective of the Content-Based Retrieval Track is to identify the images in the entire collection which have global or local matches to the query-image by visual and semantic concepts. We consider two images having global matches when they depict the similar scene (for example, two night urban shots). Images have local matches when there are presented similar objects with possibly different backgrounds.

Examples of images treated as similar to each other:

Examples of images treated as probably similar:

Examples of images treated as non-similar:

For this track the standard procedure is used.

Test Collection

Flickr image test collection is used for this track.

Task Description for Participating Systems

We provide a list of images randomly selected from the same data collection. These images are considered as queries for content-based query-by-example searchers. Expected result is an ordered list of image names. Maximum list size is 100 per query.

Participants are allowed to submit more than one result per query.

Evaluation Methodology

Evaluation will be performed by relevance assessors. Taking into consideration the high subjectivity of image similarity judgment for this kind of task, several independent assessors will take part in the evaluation process. Pooling approach will be used to evaluate the results (pool depth is 50). Image pools will be created for a randomly selected subset of queries. The pools will be judged for relevance by assessors based on three-level scale: 1) relevant, 2) probably relevant, 3) not relevant. Relevance assessments are entirely based on the visual content of images.

instructions for assessors:
assessors evaluate image similarity to a query-image based on the visual content.
evaluation method: pooling (pool depth is 50)
relevance scale:
- yes / probably yes / no
official metrics
- precision
- recall
- TREC 11-point precision/recall graph
- bpref

Content-Based Retrieval Track

Overview

Test Collection

Task Description for Participating Systems

Evaluation Methodology

Data Formats