MARCO: MAchine Recognition of Crystallization Outcomes

The MARCO Image data sets are made available by the University at Buffalo under a Creative Commons Attribution 4.0 International License (CC BY 4.0). By downloading any data you agree to the terms of this license.

Train (original)

Train dataset in shared TFRecords format containing 415,775 original images

Validation (original)

Validation dataset in shared TFRecords format containing 47,029 original images

Train (JPEG)

Train dataset in shared TFRecords format containing 415,775 JPEG images in RGB colorspace

Validation (JPEG)

Validation dataset in shared TFRecords format containing 47,029 JPEG images in RGB colorspace

Normalized Labels

List of normalized labels and their IDs in CSV format

Raw Labels

List of raw original labels and their IDs in CSV format

Sources

List of source organizations and their IDs in CSV format

TAR Archives

Tar archive of image tfrecords files

Working with TFRecords files

The train and test (validation) datatsets are provided in TFRecords format. They can be used directly with the TensorFlow API. If you're not using TensorFlow we recommend using terf to inspect or extract the raw image data.

For example, assuming you've downloaded the MARCO training data to a directory called train. Extract the raw image data by downloading terf for your platform and running the following command:

./terf -d extract --input train/ -o marco-data/

Details

The MARCO data set contains 462,804 scored images from five source institutions:

0 - Collaborative Crystallisation Centre
1 - GlaxoSmithKline
2 - Hauptman-Woodward Medical Research Institute
3 - Merck & Co.
4 - Bristol-Myers Squibb

The images were collected from imagers made from two different manufacturers (Rigaku Automation and Formulatrix), which have different optical systems, as well as by the in-house imaging equipment built at the Hauptman-Woodward Medical Research Institute.

Images were scored by one or more crystallographers using the following 4 labels:

0 - Clear
1 - Crystals
2 - Other
3 - Precipitate

The MARCO data set consists of train and test shared data sets in TFRecords file format. TFRecords file fromat is the recommended format for TensorFlow and contains tf.train.Example protocol buffers (which contain Features as a field). Each TFRecords file contains ~1024 records. Each record within the TFRecords file is a serialized Example proto. The Example proto contains the following fields:

image/height: integer, image height in pixels
image/width: integer, image width in pixels
image/colorspace: string, specifying the colorspace, always 'RGB'
image/channels: integer, specifying the number of channels, always 3
image/class/label: integer, specifying the index in a normalized classification layer
image/class/raw: integer, specifying the index in the raw (original) classification layer
image/class/source: integer, specifying the index of the source (creator of the image)
image/class/text: string, specifying the human-readable version of the normalized label
image/format: string, specifying the format, always 'JPEG'
image/filename: string containing the basename of the image file
image/id: integer, specifying the unique id for the image
image/encoded: string, containing JPEG encoded image in RGB colorspace

Here's the summary for each dataset:

-------------
Train
-------------
Total: 415775
Label: 
    - Other: 23299
    - Clear: 139865
    - Precipitate: 199567
    - Crystals: 53044
Source: 
    - C3: 13111
    - GSK: 70248
    - HWI: 67225
    - Merck: 257889
    - BMS: 7302
Format: 
    - PNG: 67225
    - JPEG: 348550

-------------
Test
-------------
Total: 47029
Label: 
    - Clear: 15659
    - Precipitate: 22360
    - Other: 2810
    - Crystals: 6200
Source: 
    - C3: 1497
    - GSK: 7870
    - HWI: 7698
    - Merck: 29153
    - BMS: 811
Format: 
    - JPEG: 39331
    - PNG: 7698

Download