Upcoming surveys like LSST and Euclid will produce billions of galaxy images but comparatively few spectra, motivating models that learn cross-modal representations. We build GalaxiesML-Spectra, a dataset of 134,533 galaxy images (HSC-PDR2) and spectra (DESI-DR1) and adapt a Multi-Modal Masked Autoencoder (MMAE) to embed both images and spectra in a shared representation.
The MMAE is a transformer-based architecture, which we train by masking 75% of the data and reconstructing missing image and spectral tokens. We use this model to test three applications: spectral and image reconstruction from heavily masked data and redshift regression from images alone. It recovers key physical features, such as galaxy shapes, atomic emission line peaks, and broad continuum slopes, though it struggles with fine image details and line strengths. For redshift regression, the MMAE performs comparably or better than prior multi-modal models in terms of prediction scatter even when missing spectra in testing. These results highlight both the potential and limitations of masked autoencoders in astrophysics and motivate extensions to additional modalities, such as text, for foundation models.
Our paper submission to the NeurIPS 2025 Machine Learning for the Physical Sciences Workshop details the dataset construction, model architecture, and test results.
View Paper on arXiv (PDF)We assembled a multi-modal dataset, referred to as GalaxiesML-Spectra, of 134,533 galaxies, each with 5-band images, 1D spectra, and spectroscopic redshifts. See the dataset access link for detailed information.
GalaxiesML-Spectra is a crossmatched dataset between DESI and HSC data. The dataset is split into 127x127 pixel and 64x64 pixel image resolutions. When using the dataset, please cite:
Fig 1: The model's architecture and reconstruction process are shown for a low redshift source with 75% masking of both modalities. We measure the peak location, amplitude, and width of H-alpha in the augmented and generated spectra. The H-alpha line has an observed center at 7042.8 Å with a height of 3.04 and a width of 34.5 Å, while the model reconstructed it at 7066.8 Å with a height of 0.62 and a width of 528 Å.
Fig 2: The model's reconstruction process is shown for a high redshift source with a fully masked spectrum and a fully unmasked image. We measure the peak location, amplitude, and width of Lyman-alpha and C IV in the augmented and generated spectra. The Lyman-alpha line has an observed center at 3851.6 Å with a height of 17.24 and a width of 48 Å, compared to a reconstructed center at 3923.6 Å, height 5.84, and width 312 Å. Similarly, the C IV line has an observed center at 4907.6 Å, height 7.07, and width 72 Å, while the reconstructed line is at 4931.6 Å, with height 2.48 and width 648 Å.
Fig 3: The model's redshift regression results for the entire redshift range are shown (right). The redshift predictions were obtained from test data that had 25% of the image masked and 100% of the spectrum masked. The low-redshift regime used for comparison to AstroCLIP is shown in more detail in the bottom left. The top left panel shows the scatter of the MMAE compared to AstroCLIP and a BCNN model for this low-redshift regime. Lower scatter corresponds to more precise predictions.
Python scripts for building and training the model, plotting notebooks, installation requirements, and model predictions are included in the repository.
Access RepositoryThis poster was presented at the NeurIPS 2025 Machine Learning for the Physical Sciences Workshop on December 6, 2025 in San Diego, California.
This publication is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).