Task Details
Our datasets, both released and evaluation set, are derived from the YFCC100M Dataset. Each dataset comprises vectors encoded from images using the CLIP model, which are then reduced to 100 dimensions using Principal Component Analysis (PCA). Additionally, categorical and timestamp attributes are selected from the metadata of the images. The categorical attribute is discretized into integers starting from 0, and the timestamp attribute is normalized into floats between 0 and 1.
For each query, a query type is randomly selected from four possible types, denoted by the numbers 0 to 3. Then, we randomly choose two data points from dataset D, utilizing their categorical attribute (C) timestamp attribute (T), and vectors, to determine the values of the query. Specifically:
- Randomly sample two data points from D.
- Use the categorical value of the first data point as v for the equality predicate over the categorical attribute C.
- Use the timestamp attribute values of the two sampled data points for the range predicate. Designate l as the smaller timestamp value and r as the larger. The range predicate is thus defined as l≤T≤r.
- Use the vector of the first data point as the query vector.
- If the query type does not involve v, l, or r, their values are set to -1.
# | Name | Description | Dataset Size | Query set Size |
---|---|---|---|---|
1 | dummy-data.bin dummy-queries.bin | dummy data and queries for packing submission in reprozip | 104 | 102 |
2 | medium scale released data and queries | 106 | 104 | |
3 | large-scale released data and queries | 107 | 4 * 106 | |
5 |
secret-data-10m.bin secret-queries-10m.bin |
secret large-scale data and queries, used for evaluation | 107 | 4 * 106 |
You can use AzCopy for downloading large-scale datasets.