Data Sets

CRISM data set

CRISM This data set contains 26,500 spectra, each described by reflectance values at 227 wavelengths from 1.1 to 2.6 μm. The spectra were obtained from a CRISM image of Nili Fossae on Mars (id: frt00003e12). 17 of the 26,500 spectra contain evidence of magnesite, a carbonate. The discovery problem posed by this data is: how quickly is magnesite discovered, given no prior knowledge?

To reduce noise and data set size, without losing important details, we first performed a superpixel segmentation on the original pixels to obtain 26,500 homogeneous segments, each represented by its mean spectrum, and then used a median filter to remove shot noise.

The data set is provided as a .mat file. Once loaded into Matlab, the following variables will be available:

  • crism_data: CRISM data matrix (26,500 spectra with 197 features). The data is sorted in decreasing order of reconstruction error obtained after applying PCA to the data and using the single first principal component to reconstruct the data.
  • crism_expert: Expert-provided labels for each spectrum. Magnesite has a label of "1". Other labels include olivine ("2"), phylosilicate ("3"), and background (uninteresting) rock ("4"). Note that the majority of the image is not labeled ("0") and, with the exception of magnesite, class labels should be taken as representative and not comprehensive.
  • wl_microns: The wavelength (in microns) for each of the 227 features.
  • K: The value of K required to capture 90% of the variance in the data set.

Download: crism_data_sort-sc1-new.mat (23 MB)

Attribution: If you use this data set in your own work, please cite the CRISM scene id (frt0003e12). If you use the labels, please cite this paper:

Kiri L. Wagstaff, Nina L. Lanza, David R. Thompson, Thomas G. Dietterich, and Martha S. Gilmore. "Guiding scientific discovery with explanations using DEMUD." Proceedings of the Twenty-Seventh Conference on Artificial Intelligence (AAAI-13), 2013.

Note: due to a file server crash, this data set differs very slightly from that described in our AAAI 2013 paper. However, it should be very, very similar and present the same challenge for magnesite discovery.


ChemCam Data Set

ChemCam This data set contains 110 laboratory spectra, each described by emission values at 6143 wavelengths from 224 to 932 nm. There are eight rock types comprising 22 samples represented by 5 spectra each.

The data set is provided as a .mat file. Once loaded into Matlab, the following variables will be available:

  • libs_data: ChemCam (LIBS) data matrix (110 spectra with 6143 features). The data is sorted in decreasing order of reconstruction error obtained after applying PCA to the data and using the single first principal component to reconstruct the data.
  • sample_type: Sample names for each sample; there are 5 spectra from each of 22 samples included. The 22 samples fall into 8 rock types, as follows:
    • AGV2: andesite
    • BHV02: basalt
    • CalciteRock: calcite
    • DH4909: olivine
    • DH4912: olivine
    • Dolomite: dolomite
    • GBW07105: basalt
    • GBW07108: limestone
    • GBW07114: dolomite
    • GBW07216a: dolomite
    • GBW07217a: dolomite
    • GUWBM: basalt
    • GYPC: gypsum
    • GYPD: gypsum
    • GreenCalcite: calcite
    • JA2: andesite
    • JD01: dolomite
    • OrangeCalcite: calcite
    • Rhodochrosite: rhodochrosite
    • Siderite: siderite
    • SideriteRock: siderite
    • WardsDolomite: dolomite
  • libs_bands: The wavelength (in nm) for each of the 6143 features.
  • K: The value of K required to capture 90% of the variance in the data set.

Download: libs_data_sort-new.mat (3 MB)

Attribution: If you use this data set in your own work, please cite this paper:

Lanza, N.L., Wiens, R.C., Clegg, S.M., Ollila, A.M., Humphries, S.D., Newsom, H.E., Barefield, J.E., and the ChemCam Team. "Calibrating the ChemCam laser-induced breakdown spectrosopy instrument for carbonate minerals on Mars." Applied Optics, 49 (13), C211-C217, 2010.