ACM SIGMOD 2024 Programming Contest

Task Details

Our datasets, both released and evaluation set, are derived from the YFCC100M Dataset. Each dataset comprises vectors encoded from images using the CLIP model, which are then reduced to 100 dimensions using Principal Component Analysis (PCA). Additionally, categorical and timestamp attributes are selected from the metadata of the images. The categorical attribute is discretized into integers starting from 0, and the timestamp attribute is normalized into floats between 0 and 1.

For each query, a query type is randomly selected from four possible types, denoted by the numbers 0 to 3. Then, we randomly choose two data points from dataset D, utilizing their categorical attribute (C) timestamp attribute (T), and vectors, to determine the values of the query. Specifically:

Randomly sample two data points from D.
Use the categorical value of the first data point as v for the equality predicate over the categorical attribute C.
Use the timestamp attribute values of the two sampled data points for the range predicate. Designate l as the smaller timestamp value and r as the larger. The range predicate is thus defined as l≤T≤r.
Use the vector of the first data point as the query vector.
If the query type does not involve v, l, or r, their values are set to -1.

We assure that at least 100 data points in D meet the query limit.

#	Name	Description	Dataset Size	Query set Size
1	dummy-data.bin dummy-queries.bin	dummy data and queries for packing submission in reprozip	10⁴	10²
2	contest-data-release-1m.bin contest-queries-release-1m.bin	medium scale released data and queries	10⁶	10⁴
3	contest-data-release-10m.bin contest-queries-release-10m.bin	large-scale released data and queries	10⁷	4 * 10⁶
5	secret-data-10m.bin secret-queries-10m.bin	secret large-scale data and queries, used for evaluation	10⁷	4 * 10⁶

You can use AzCopy for downloading large-scale datasets.