Download
Train (original)
Train dataset in shared TFRecords format containing 415,775 original images
Validation (original)
Validation dataset in shared TFRecords format containing 47,029 original images
Train (JPEG)
Train dataset in shared TFRecords format containing 415,775 JPEG images in RGB colorspace
Validation (JPEG)
Validation dataset in shared TFRecords format containing 47,029 JPEG images in RGB colorspace
Normalized Labels
List of normalized labels and their IDs in CSV format
Raw Labels
List of raw original labels and their IDs in CSV format
Sources
List of source organizations and their IDs in CSV format
TAR Archives
Tar archive of image tfrecords files
Working with TFRecords files
The train and test (validation) datatsets are provided in TFRecords format. They can be used directly with the TensorFlow API. If you're not using TensorFlow we recommend using terf to inspect or extract the raw image data.
For example, assuming you've downloaded the MARCO training data to a directory
called train
. Extract the raw image data by downloading terf for your
platform and running the following command:
./terf -d extract --input train/ -o marco-data/
Details
The MARCO data set contains 462,804 scored images from five source institutions:
- 0 - Collaborative Crystallisation Centre
- 1 - GlaxoSmithKline
- 2 - Hauptman-Woodward Medical Research Institute
- 3 - Merck & Co.
- 4 - Bristol-Myers Squibb
Images were scored by one or more crystallographers using the following 4 labels:
- 0 - Clear
- 1 - Crystals
- 2 - Other
- 3 - Precipitate
The MARCO data set consists of train and test shared data sets in TFRecords file format. TFRecords file fromat is the recommended format for TensorFlow and contains tf.train.Example protocol buffers (which contain Features as a field). Each TFRecords file contains ~1024 records. Each record within the TFRecords file is a serialized Example proto. The Example proto contains the following fields:
image/height: integer, image height in pixels image/width: integer, image width in pixels image/colorspace: string, specifying the colorspace, always 'RGB' image/channels: integer, specifying the number of channels, always 3 image/class/label: integer, specifying the index in a normalized classification layer image/class/raw: integer, specifying the index in the raw (original) classification layer image/class/source: integer, specifying the index of the source (creator of the image) image/class/text: string, specifying the human-readable version of the normalized label image/format: string, specifying the format, always 'JPEG' image/filename: string containing the basename of the image file image/id: integer, specifying the unique id for the image image/encoded: string, containing JPEG encoded image in RGB colorspaceHere's the summary for each dataset:
------------- Train ------------- Total: 415775 Label: - Other: 23299 - Clear: 139865 - Precipitate: 199567 - Crystals: 53044 Source: - C3: 13111 - GSK: 70248 - HWI: 67225 - Merck: 257889 - BMS: 7302 Format: - PNG: 67225 - JPEG: 348550 ------------- Test ------------- Total: 47029 Label: - Clear: 15659 - Precipitate: 22360 - Other: 2810 - Crystals: 6200 Source: - C3: 1497 - GSK: 7870 - HWI: 7698 - Merck: 29153 - BMS: 811 Format: - JPEG: 39331 - PNG: 7698