Task Details


Our datasets, both released and evaluation set, are derived from the YFCC100M Dataset. Each dataset comprises vectors encoded from images using the CLIP model, which are then reduced to 100 dimensions using Principal Component Analysis (PCA). Additionally, categorical and timestamp attributes are selected from the metadata of the images. The categorical attribute is discretized into integers starting from 0, and the timestamp attribute is normalized into floats between 0 and 1.

For each query, a query type is randomly selected from four possible types, denoted by the numbers 0 to 3. Then, we randomly choose two data points from dataset D, utilizing their categorical attribute (C) timestamp attribute (T), and vectors, to determine the values of the query. Specifically:

  • Randomly sample two data points from D.
  • Use the categorical value of the first data point as v for the equality predicate over the categorical attribute C.
  • Use the timestamp attribute values of the two sampled data points for the range predicate. Designate l as the smaller timestamp value and r as the larger. The range predicate is thus defined as l≤T≤r.
  • Use the vector of the first data point as the query vector.
  • If the query type does not involve v, l, or r, their values are set to -1.
We assure that at least 100 data points in D meet the query limit.


# Name Description Dataset Size Query set Size
1

dummy-data.bin

dummy-queries.bin

dummy data and queries for packing submission in reprozip 104 102
2

contest-data-release-1m.bin

contest-queries-release-1m.bin

medium scale released data and queries 106 104
3

contest-data-release-10m.bin

contest-queries-release-10m.bin

large-scale released data and queries 107 4 * 106
5

secret-data-10m.bin

secret-queries-10m.bin

secret large-scale data and queries, used for evaluation 107 4 * 106

You can use AzCopy for downloading large-scale datasets.