a new benchmark for supervised particle pose estimation in Cryo-EM. (2024)

Ruben Sanchez-Garcia^1,2, Michael Saur², Javier Vargas³, Carl Poelking², Charlotte M Deane¹ ¹Department of Statistics, University of Oxford, Oxford OX1 3LB, UK
{ruben.sanchez-garcia, deane}@stats.ox.ac.uk ²Astex Pharmaceuticals, Cambridge CB4 0QA, UK
{michael.saur, carl.poelking}@astx.com ³Departamento de Optica, Universidad Complutense de Madrid, Madrid 28040, Spain
jvargas@fis.ucm.es

Abstract

Cryo-EM is a powerful tool for understanding macromolecular structures, yet current methods for structure reconstruction are slow and computationally demanding. To accelerate research on pose estimation, we present CESPED, a new dataset specifically designed for Supervised Pose Estimation in Cryo-EM. Alongside CESPED, we provide a PyTorch package to simplify Cryo-EM data handling and model evaluation. We evaluated the performance of a baseline model, Image2Sphere, on CESPED, which showed promising results but also highlighted the need for further improvements. Additionally, we illustrate the potential of deep learning-based pose estimators to generalise across different samples, suggesting a promising path toward more efficient processing strategies. CESPED is available at https://github.com/oxpig/cesped.

1 Introduction

1.1 Cryo-EM Single Particle Analysis

Determining the structure of macromolecules is crucial to deciphering the intricacies of biological processes and the underlying mechanisms of diseases. With the advent of the resolution revolution, Cryogenic Electron Microscopy (Cryo-EM) has emerged as a leading technique for elucidating structures [36, 6]. This revolution, driven by significant advances in direct electron detectors and image processing algorithms, has made Cryo-EM a routine, often unrivaled, method for many complex samples [12]. Its advantages include, among others, the relative ease of sample preparation compared to other techniques (e.g., x-ray crystallography), the capability to analyze protein complexes previously considered out of reach, and the ability to recover different conformations, offering a dynamic view of molecules in action [37]. The pivotal role of Cryo-EM in structural biology was globally recognised in 2017 when the technique was awarded the Nobel Prize in Chemistry.

The primary aim of Cryo-EM Single-Particle Analysis (SPA) is to reconstruct the three-dimensional (3D) structure of a given macromolecule at near-atomic resolution, ideally better than 3 Å. This process uses electron beams to capture thousands of two-dimensional (2D) images of the macromolecules, which are flash frozen in vitreous ice to preserve their native state without the distortions typical of crystalline ice or other fixation methods [38]. Each image, called a micrograph, can display several hundred snapshots of the macromolecule (referred to as particle images or just particles) in unknown random orientations. If the orientations of these images were known, the reconstruction task would closely resemble the algorithms used in tomography, which reconstruct 3D volumes from 2D projections taken at predetermined angles [18]. However, the unknown orientations of the particles in SPA present a unique challenge not encountered in tomography [3]. Compounding this challenge is the inherently low contrast and extremely poor signal-to-noise ratio (SNR) of the images, a consequence of the delicate biological nature of the samples. Given these challenges, a highly sophisticated image processing pipeline is essential to accurately resolve the 3D structure of the macromolecule [32, 50].

The fundamental principle of image processing in SPA is grounded in the intuitive strategy of employing averaging to mitigate noise. Since images are characterised by a low SNR, averaging multiple images of the same particle, assumed to be identical, can significantly enhance the underlying signal [37]. However, before averaging can be effectively carried out, each particle projection must be aligned to a common orientation. This ensures that the differences observed across the images are solely due to noise, allowing its effective cancellation during the averaging process.

The standard Cryo-EM image processing pipeline encompasses several key steps, beginning with various preprocessing operations to correct errors such as beam-induced movement blur, followed by particle picking, which extracts the individual particle images from the micrographs [32, 50]. Subsequent stages include, among others, clustering (commonly referred to as 2D classification in the context of Cryo-EM) and particle alignment against references, leading to a cleaner subset of the data and an initial low-resolution 3D volume of the protein. This preparatory work sets the stage for the refinement step, a critical phase where the poses of the particles are precisely estimated, a requirement to achieve the high-resolution volumes needed to reveal atomic level details.

Traditional refinement algorithms perform pose estimation by exhaustive comparison of experimental particle images and simulated projections of 3D volumes that are iteratively improved [7, 16, 43, 46, 49]. When sample hom*ogeneity can be assumed, the simplest approach to the pose estimation problem is the projection matching algorithm [41], which consists of $T$ iterations of two steps: alignment and reconstruction. First, in the alignment phase, the pose $(R,s)_{i}\in\mathrm{SO(3)}\times\mathbb{R}^{2}$ of each experimental particle image $x_{i}$ is set to be the same as the one of the most similar 2D projection of the reference volume $V^{t}$ at iteration $t$ ,

(R,s)_{i}=\arg\min_{{(R,s)\in\mathrm{SO(3)}\times\mathbb{R}^{2}}}\left\|x_{i}-%f_{i}\ast P_{(R,s)}V^{t}\right\|^{2}

(1)

where $P_{(R,s)}$ is the projector operator, $f_{i}$ is the point spread function of the microscope for the $i$ -th particle and $*$ the convolution operator. Then, in the reconstruction phase, a new volume (in reciprocal space) is computed from the estimated poses as

\hat{V}^{(t+1)}=\frac{\sum_{i=1}^{N}P_{(R,s)_{i}}^{-1}\hat{f}_{i}\hat{x}_{i}}{%\sum_{i=1}^{N}P_{(R,s)_{i}}^{-1}\hat{f}_{i}^{2}+C_{i}}

(2)

with $\hat{V}$ being the Fourier transform of the volume $V$ , $C_{i}$ a constant depending on the SNR, $\hat{f_{i}}$ the Fourier transform of the point spread function (CTF, contrast transfer function) and $N$ the number of particles. This iterative process continues until convergence.State-of-the-art methods build on this approach, for example, Relion [46] employs a Bayesian probabilistic model with a prior for the map, making it much more robust. CryoSPARC [43] accelerates Bayesian methods through branch-and-bound search and gradient descent optimisation. See [3] for a review.

Despite the innovations aimed at enhancing efficiency, the refinement process still poses significant computational challenges. The primary factor contributing to these challenges is the large number of image comparisons required for each experimental image. Furthermore, the iterative refinement of the volumes, beginning with an initial low-resolution model and progressively improving it, further increases the computational cost, making the refinement stage the most computationally intensive step in the Cryo-EM workflow.

1.2 Deep Learning for Pose Estimation in Real-World Objects

Similar to refinement algorithms in Cryo-EM, traditional pose estimation techniques for real-world images primarily focus on matching 2D images with 3D objects. The significantly higher SNRs characteristic of real-world images enable the use of more sophisticated and efficient methods beyond simple template matching. Among these, landmark-based registration methods are particularly prevalent. Such methods involve extracting distinctive landmarks through various feature extraction techniques [30, 2], followed by a registration process to identify the relative orientation of the landmarks in the image with respect to the reference landmarks [5] .

PoseNet [23] was a groundbreaking development in this field, leveraging a Convolutional Neural Network (CNN) to directly regress the absolute pose of an object using quaternions and $xyz$ shifts. This direct approach contrasts with earlier techniques that relied heavily on feature extraction and landmark identification, allowing for end-to-end pose estimation. Subsequent innovations have built on the foundation laid by PoseNet. Improvements in network architectures [33], the introduction of more sophisticated loss functions [22], and the incorporation of multitask learning [52] have contributed to significant improvements in pose estimation performance.

Addressing the inherent challenges of symmetry and occlusion in pose estimation has also seen considerable progress through deep learning. Strategies have evolved from breaking symmetry during the data labelling process [51] to implementing loss functions specifically designed to accommodate known symmetries [52]. Probabilistic models offer alternative approaches that either classify poses within a discretised space or explicitly learn the parameters of probability distributions [8, 31, 34, 35, 42]. Due to their probabilistic nature, these models are better suited for challenging datasets with high levels of ambiguity or noise.

1.3 Deep learning methods for Cryo-EM structure determination or pose estimation

While traditional Cryo-EM refinement algorithms tend to be relatively robust and accurate, they are computationally intensive and slow. In an attempt to overcome this, deep learning (DL) alternatives have begun to emerge.

Unsupervised DL methods aim to determine the 3D structure of macromolecules from experimental images alone. Some of them tackle the problem using a distance learning approach in which the angular distance between pairs of images is estimated as a preprocessing step to retrieve their relative poses [1]. Other unsupervised DL methods mirror traditional techniques by maintaining a 3D volume representation to compute 2D projections in a differentiable manner [11]. Unlike traditional refinement methods, which compare each experimental particle against all images in an SO(3) projection gallery with up to millions of members, these methods try to limit the number of comparisons between experimental images and projections. For instance, in CryoGAN a 3D volume, randomly initialised, serves as the generator in the Generative Adversarial Network (GAN) framework [14]. This generator produces a set of projections from random orientations that are then fed to a discriminator network along with real experimental images. The objective of the training process is to refine the generator until the discriminator can no longer distinguish between the generated projections and the actual experimental images, effectively capturing the underlying 3D structure present in the experimental data. In some other approaches [27, 26] , particle images are first processed by an encoder designed to predict particle orientations. Following this prediction, a projection of the representation of the volume corresponding to the inferred orientation is rendered. This projection is then directly compared to the original experimental particle image. A loss function is utilized to concurrently refine both the encoder’s parameters and the representation of the volume, improving the accuracy of orientation predictions and the fidelity of the reconstructed volume.

Supervised DL models, on the other hand, are trained using experimental images and some form of (possibly noisy) labels, such as the poses of prealigned sets of particles. The simplest alternative consists of only an encoder module that predicts the orientation of the particle directly from its image [21, 28]. Although supervised approaches offer remarkable efficiency and speed, they require labelled data for training, thus limiting their applicability in de novo situations. However, there are use cases where supervised DL methods could offer and advantange. For instance, it should be possible to apply them to on-the-fly pipelines in which a first batch of particles is pre-aligned before the end of the data stream. This initial alignment could be used to train a supervised model to be applied to subsequent batches of data, inferring their poses in real time. Even more interestingly, a pre-trained supervised model could be used to infer poses in different projects, providing the new samples are similar to the training data. This second use case relies on the fact that pose estimation in classical methods is mainly driven by low- to mid-resolution frequencies [47]. As similar proteins have similar low- to mid-resolution frequencies, trained models are expected to generalise to these new samples. In addition, because ligand binding does not generally modify the overall shape of proteins, supervised approaches can be especially valuable in drug discovery, where pre-aligned data for target proteins is often available.

In the context of Cryo-EM, only two supervised methods have been proposed to perform direct pose estimation given prealigned particles. DeepAlign [21], a set of CNNs that perform binary classification over a discretisation of $S^{2}$ , and the approach of Lian etal. [28] who implemented a CNN to perform direct regression of quaternions. However, due to its limitations, especially for symmetric data, Lian et al. finally adopted a hybrid model with a projector as in some Cryo-EM Unsupervised estimators. Neither of the two methods has been used in practical scenarios.

While much slower, classical refinement methods still outperform DL pose estimation models in terms of performance and reliability. This gap can be partly attributed to the unique characteristics of Cryo-EM data, which differ from the natural images DL artchitectures were designed for and, importantly, to the lack of a standardized benchmark that would allow for a direct comparison of methods in order to stimulate progress, much as ImageNet [9] did for image classification. In this paper, we introduce CESPED (Cryo-EM Supervised Pose Estimation Dataset), a benchmark specifically designed to evaluate supervised pose estimation methods. As the first benchmark dedicated to pose estimation in Cryo-EM, CESPED addresses a crucial gap in the array of available datasets, which have, until now, primarily focused on other Cryo-EM challenges, such as model building [13] and particle picking [17, 10]. CESPED aims to foster advancements in DL methods for particle processing by promoting improvements in supervised pose estimation models, which, due to shared architectural building blocks and data challenges, are likely to benefit methods for related tasks as well.

1.4 Main contributions.

In this study, we provide an accessible entry point for a wider scientific audience to engage with the challenges of SPA in Cryo-EM. Toward this goal:

•
We compile CESPED (Cryo-EM Supervised Pose Estimation Dataset), an easy-to-use benchmark specifically designed for Supervised Pose Estimation in Cryo-EM.
•
We implement a PyTorch-based [39] package to handle Cryo-EM particle data and to easily compute Cryo-EM quality metrics.
•
We train and evaluate the Image2Sphere model [25], originally developed for real-world pose estimation, on our benchmark, illustrating the utility of our benchmark and shedding light on the transferability of real-world pose estimation models to the Cryo-EM domain.
•
We present a use case demonstrating that deep learning-based supervised pose estimators have the potential to generalise across related but different samples.

2 Methods

2.1 Benchmark Compilation and preprocessing

In our effort to build a comprehensive benchmark, our primary goal was to identify a diverse set of EMPIAR entries containing at least 200,000 particles, a number deemed sufficient for effective model training. Due to the limitations of EMPIAR’s search functionality and the inconsistencies in dataset annotations, we conducted a manual search for entries exceeding this particle count and containing standard Relion files ( .star and .mrcs). Subsequently, we verified the consistency and accuracy of the metadata by running relion_reconstruct [46] and visually assessing the resulting volumes. This step was crucial for eliminating a significant number of entries due to metadata issues that either crashed the reconstruction process or led to incorrect volumes. To ensure consistent estimation of particle poses, the data was re-processed using the Relion version 4 auto-refine program [46, 24] (see Appendix Appendix A for details). Only entries for which the reconstructed volume exhibited resolution values close to those reported in the literature were selected for inclusion in the benchmark. Finally, for consistency, all images were downsampled to $1.5$ Å/pixel, with different image dimensions in each entry as macromolecules vary in size. See Appendix A for a list of the entries and their properties.

The images fed to the deep learning model were preprocessed on-the-fly. We performed per-image normalisation following the standard Cryo-EM procedure, which involves rescaling the intensity so that the background (noise) has a mean of 0 and a standard deviation of 1. We also corrected the contrast inversion caused by the defocus via phase flipping [40]. Finally, since the macromolecule typically represents only between 25% to 50% of the whole particle image, the images were cropped so that neighbouring particles are not included. It is important to note that our benchmark package allows users the flexibility to choose whether or not to apply any of these normalisation steps.

The data labels are represented as rotation matrices and then converted into grid indices by finding the closest rotation matrix in the $\mathrm{SO(3)_{grid}}$ . For the cases in which the macromolecule exhibits point symmetry, the labels are expanded as $L_{i}={\{g_{j}R_{i}|g_{j}\in G}\}$ where $G$ is the set of rotation matrices given a point symmetry group (e.g., $C1$ ), and $R_{i}$ the ground truth rotation matrix. As a result, the labels consist of vectors with $|G|$ non-zero values and $|\mathrm{SO(3)_{grid}}|-|G|$ zeros.

2.2 Baseline model

We adapted the state-of-the-art Image2Sphere model [25]. Image2Sphere is a hybrid architecture that uses a ResNet to produce a 2D feature map of the input image, which is then orthographically projected onto a 3D hemisphere and expanded in spherical harmonics. Then, equivariant group convolutions are applied, first with global support on the $S^{2}$ sphere, and finally as a refinement step, on SO(3). The output of the model is a probability distribution over a discretised grid of rotation matrices. Other supervised Cryo-EM methods for pose estimation were not considered for this work due to the lack of publicly available code [28] or their GUI requirements [21].

2.3 Evaluation metrics

The most widely used metric in pose estimation is the mean angular error ( $\mathrm{MAnE}$ ), averaged across all poses,

\mathrm{MAnE}=\frac{1}{N}\sum_{i=1}^{N}\mathrm{angError}_{i}

(3)

The angular error ( $\mathrm{angError}$ ) measures the geodesic distance between the predicted and ground truth poses, typically expressed in degrees or radians. This distance can be directly calculated from the rotation matrix of the ground truth pose $\mathrm{trueR}_{i}$ and the predicted rotation matrix $\mathrm{predR}_{i}$ as

\mathrm{angError}_{i}=\arccos\left(\frac{\mathrm{trace}(\mathrm{trueR}_{i}%\cdot\mathrm{predR}_{i}^{T})-1}{2}\right)

(4)

When evaluating predicted orientations of macromolecules exhibiting point symmetry, it is necessary to adjust the angular error, as several rotation matrices become equivalent. In this context, the angular error is defined as the minimum geodesic distance between the predicted orientation and any orientation equivalent to the ground truth under the molecule’s symmetry group

\mathrm{angError}_{i}=\min_{g_{j}\in G}\arccos\left(\frac{\mathrm{trace}(g_{j}%\cdot\mathrm{trueR}_{i}\cdot\mathrm{predR}_{i}^{T})-1}{2}\right)

(5)

with $G$ being the set of rotation matrices given a point symmetry group.

However, due to the uncertainty in the estimated poses [21], we propose additional metrics. The first one is the confidence-weighted mean-angular-error,

\mathrm{wMAnE}=\frac{\sum_{i=1}^{N}\mathrm{conf}_{i}\cdot\mathrm{angError}_{i}%}{\sum_{i=1}^{N}\mathrm{conf}_{i}}

(6)

which weights the $\mathrm{angError_{i}}$ by $\mathrm{conf_{i}}$ , the confidence in the ground truth pose, measured as Relion’s rlnMaxValueProbDistribution. This confidence estimation is a number between 0 and 1 that measures the probability of the particle having the reported ground truth orientation according to the Relion model. While $\mathrm{wMAnE}$ is still sensitive to ground truth and confidence estimation errors, due to its simplicity, we used it as the criterion for hyperparameter tuning.

The quality of volumes reconstructed from the predicted poses is assessed by comparing them with the ground truth volumes generated from the original poses (see Appendix H). For this comparison, we employ the real space Pearson’s Correlation Coefficient (PCC) and the Fourier Shell Correlation (FSC) Resolution as metrics.

The Pearson’s correlation coefficient is a value between -1 and 1, where values closer to 1 indicate a higher similarity. It measures the linear correlation between the pixels of the two volumes as follows:

\mathrm{PCC}(X,Y)=\frac{\sum_{i=1}^{n}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{%\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}}\sqrt{\sum_{i=1}^{n}(Y_{i}-\bar{Y})^{2}}}\quad

(7)

with $X_{i}$ and $Y_{i}$ being the pixel $i$ of the two volumes, $n$ the number of voxels, and $\bar{X}$ and $\bar{Y}$ the average value of the volumes.

The FSC quantifies the correlation between two signals at different spatial frequencies. For each frequency $k$ , a value between -1 and 1 (with higher values indicating greater similarity) is computed by comparing the concentric shells in the Fourier transforms of the two volumes corresponding to $k$ :

\text{FSC}_{k}(X,Y)=\frac{\sum_{\mathbf{r}\in\text{shell}(k)}\hat{X}(\mathbf{r%})\cdot\hat{Y}^{*}(\mathbf{r})}{\sqrt{\left(\sum_{\mathbf{r}\in\text{shell}(k)%}|\hat{X}(\mathbf{r})|^{2}\right)\cdot\left(\sum_{\mathbf{r}\in\text{shell}(k)%}F|\hat{Y}(\mathbf{r})|^{2}\right)}}

(8)

2.4 Training

Each benchmark entry was trained independently with the same hyperparameters (see Appendix B). Due to the uncertainty in the estimated orientations, we employed a weighted cross-entropy loss using the pose reliability estimate of each particle as the per-image weight

L=\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C}-\mathrm{conf}_{i}\cdot P(R_{c,i})%\log(Q(R_{c,i}))

(11)

where $Q(R_{c,i})$ is the predicted probability for the rotation matrix with grid index $c$ and $P(R_{c,i})$ is $1/|G|$ when any of the ground truth matrices is $c$ and zero otherwise.

2.5 Evaluation protocol

a new benchmark for supervised particle pose estimation in Cryo-EM. (1)

Due to the uncertainty in the ground truth labels and the fact that what matters to Cryo-EM practitioners is the quality of the reconstructed volume, we devised an evaluation protocol inspired by the Cryo-EM gold standard [20], which is a per-entry 2-fold cross-validation strategy in which the poses of each half of the data are independently estimated and used to reconstruct two volumes (half-maps). For benchmarking supervised methods, it involves training an independent model for each half of the dataset to infer the poses of the other half of the dataset. After that, the final 3D volume is computed by reconstructing the two half-maps and averaging them. The final averaged map can then be compared with the ground truth map obtained from the original poses (see Figure 1). It is important to note that the FSC resolution values derived from this comparison are analogous to map-to-model FSC resolution estimations and not equivalent to the gold standard half-to-half resolution.

Since Image2Sphere predicts only rotation matrices but not image shifts, when reconstructing the volumes, we employed the ground truth translations. This could result in an overoptimistic estimation of performance, however, since the effect of the translations is tightly coupled with the accuracy of the angular estimation, this overestimation should be small. We leave for future work the full inference of both the rotations and translations. Finally, to avoid overfitting to the validation set, we performed hyperparameter tuning only one half of the data using $\mathrm{wMAnE}$ as a metric.

3 Results and discussion

3.1 Benchmark, ParticlesDataset class and evaluation tool

Our benchmark consists of a diverse set of eight macromolecules, with an average number of 300K particles including soluble and membrane macromolecules, symmetric and asymmetric complexes, and resolutions ranging from $5\text{\,}\mathrm{\SIUnitSymbolAngstrom}$ to $3.2\text{\,}\mathrm{\SIUnitSymbolAngstrom}$ (see Appendix A). For each particle in the dataset, we provide its image and estimated pose together with an estimate of the reliability of the poses. The benchmark can be automatically downloaded from Zenodo [53] using our $\mathrm{cesped}$ Python package.

The package includes a ParticlesDataset class, which implements the PyTorch Dataset API for seamless integration. It also offers optional yet recommended preprocessing steps commonly adopted in Cryo-EM (e.g., image normalisation, phase flipping), and specialised data augmentation techniques, like affine transformations that adjust both the image and its corresponding pose (see Appendix B). While the $\mathrm{cesped}$ package was designed with PyTorch in mind, the benchmark is accessible to a broader audience, as the data is stored in standard formats and accompanied by utility programs to assist users of other frameworks in adopting the CESPED benchmark.

Additionally, the package offers an automatic evaluation pipeline that only requires as inputs the predicted poses (grey box in Figure 1). For ease of use, a Singularity¹¹1https://zenodo.org/records/4667718 image definition file is included, eliminating the need to install Cryo-EM-specific software like Relion. This design enables those without Cryo-EM experience to utilize the cesped benchmark and package as effortlessly as they would with standard datasets such as MNIST. Usage examples can be found in Appendix C.

3.2 Performance of the baseline model on the benchmark

Table 1 summarises the results of the Image2Sphere [25] model on our benchmark, with per-entry results in Appendix D. While the $\mathrm{wMAnE}$ is $\sim$ 24°, for the best cases, the error is as small as 9°. The $\Delta\mathrm{PCC}$ for the worse cases is > 0.1, highlighting that, for some entries, the reconstructed volumes are far from the ground truth solution. For a few cases, the results are much better, with $\Delta\mathrm{PCC}$ < 0.03. In terms of prediction vs ground truth $\mathrm{FSCR_{0.5}}$ , most maps are in the 8-6 Å range, with $\Delta\mathrm{FSCR_{0.5}}$ of 3.6 Å. However, the $\mathrm{FSCR_{0.143}}$ values between 4-5 Å, indicate better correlation at lower signal levels. This visually translates into a relatively well-resolved central part of the map that becomes blurrier away from the centre (see Figure 2 andAppendix E). For the top-performing cases, a simple and fast local refinement of the predicted poses is sufficient to obtain high-resolution reconstructions comparable to ground truth volumes, at a computational cost threefold less than global refinement (Appendix F). Since the Image2Sphere model inference takes only minutes, far less than the hours needed for traditional refinement, further improvements could reduce computing times by at least one order of magnitude if local refinement is no longer needed (see Appendix G for running times).

	$\mathrm{MAnE}$ (°)	$\mathrm{wMAnE}$ (°)	$\Delta\mathrm{PCC}$	$\Delta\mathrm{FSCR_{0.5}}$ (Å)	$\Delta\mathrm{FSCR_{0.143}}$ (Å)
mean (std)	28.7 (12.7)	23.8 (12.2)	0.059 (0.033)	3.4 (0.6)	1.3 (0.7)

Given the inherent difficulties of Cryo-EM data, the fact that a generic pose estimation model can produce meaningful results in some examples without major modifications suggests that equivariant architectures can be useful for the Cryo-EM data domain.

a new benchmark for supervised particle pose estimation in Cryo-EM. (2)

3.3 Example of model generalisability across samples

One of the main potential applications of Supervised Pose Estimation models is to infer poses on similar, yet different projects. In this section, we illustrate this use case by using an Image2Sphere model trained on the EMPIAR-10280 dataset, to predict poses of the same protein under different experimental conditions (EMPIAR-10278 dataset).

Figure 3 showcases three reconstructed volumes: (1) EMPIAR-10278 using ground truth poses (grey); (2) EMPIAR-10278 with poses predicted by the model trained on EMPIAR-10280 (yellow), illustrating the model’s generalisability; and (3) EMPIAR-10280 using poses inferred by the model trained on its own dataset, serving as a control for model performance. As expected, the EMPIAR-10278 map reconstructed with original poses shows superior quality compared to the others. Similarly, the EMPIAR-10280 map generated from the model trained on EMPIAR-10280 exhibits better quality than the EMPIAR-10278 map inferred using the EMPIAR-10280 model, reflecting the differences between the two datasets despite containing the same protein. Independently of these quality differences, the model’s capacity for generalization across datasets is evident through visual inspection of the EMPIAR-10278 inferred map (yellow), as the overall shape of the protein and several key secondary structure elements are clearly recognizable. This suggests that further improvements in the model could lead to the desired goal of training the model once and then inferring the poses of similar datasets at much faster speeds.

a new benchmark for supervised particle pose estimation in Cryo-EM. (3)

3.4 Challenges and future directions

Cryo-EM particle images are fundamentally different from the kinds of images encountered in other fields. One of the most critical challenges is their poor SNR, which can be as low as $0.01$ [3]. While some methods have tried to mitigate this issue by applying filtering techniques [26] or using CNNs with larger kernel sizes [4, 21, 45], these solutions are not entirely effective.

Symmetry presents another complex facet of Cryo-EM data. Exploiting symmetry can drastically reduce the computational requirements for pose estimation, but it can also prevent simple models from learning. The unique combination of rotationally equivariant convolutions with the probabilistic estimation of poses makes the Image2Sphere model an ideal candidate to exploit this feature. However, the hybrid $S^{2}$ / SO(3) formalism means that the separation of rotational degrees of freedom from translational in-plane shifts is not easily achieved within this framework. A significant area for future work lies in leveraging rotational equivariance and translational equivariance for the joint estimation of the rotational and translational components of the poses (e.g. SE(3) equivariance).

In this work we have considered only the case of hom*ogeneous refinement, which assumes that all particles are projections from a single macromolecule in a unique conformation. However, this is not always the case and our benchmark could potentially be extended to deal with such examples. Models would then need to perform conformation classification alongside pose estimation.

4 Conclusions

Pose estimation is one of the most critical steps of the Cryo-EM processing pipeline, and while current algorithms are relatively robust and reliable, they are also computationally slow. Deep learning holds the promise of overcoming these challenges, but achieving this potential hinges on improvements in accuracy and reliability, for which systematic benchmarking is required. In this study, we introduce a benchmark specifically designed for Supervised Pose Inference of Cryo-EM particles, along with a suite of code utilities to assist machine learning practitioners unfamiliar with Cryo-EM. We also present a real-world image pose prediction model applied to our benchmark, demonstrating promising preliminary results on a subset of the data. This preliminary success suggests that addressing Cryo-EM-specific challenges, such as high noise levels and label inaccuracies, could lead to even better performance. The improvements in models for this benchmark will not only pave the way for more effective Supervised Pose Prediction models, but are also likely to give rise to innovative approaches to closely related challenges like Unsupervised Pose Estimation and Heterogeneity Analysis. Ultimately, those advancements could serve as a catalyst for even further developments, leading to a new paradigm in Cryo-EM image processing.

5 Availability

CESPED dataset and code can be found at https://github.com/oxpig/cesped.

6 Acknowledgments

Ruben Sanchez-Garcia is funded by an Astex Pharmaceuticals Sustaining Innovation Post-Doctoral Award. Javier Vargas is financially supported by the Spanish Ministerio de Ciencia e Innovación, Grant PID2022-137548OB-I00 funded by MCIN/AEI/10.13039/501100011033/.

References

Banjac etal. [2021]Jelena Banjac, Laurène Donati, and Michaël Defferrard.Learning to recover orientations from projections in single-particle cryo-EM.arXiv, apr 2021.doi: 10.48550/arxiv.2104.06237.URL http://arxiv.org/abs/2104.06237.
Bay etal. [2008]Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool.Speeded-Up Robust Features (SURF).Computer Vision and Image Understanding, 110(3):346–359, jun 2008.ISSN 1077-3142.doi: 10.1016/J.CVIU.2007.09.014.
Bendory etal. [2020]Tamir Bendory, Alberto Bartesaghi, and Amit Singer.Single-particle cryo-electron microscopy: Mathematical theory, computational challenges, and opportunities.IEEE signal processing magazine, 37(2):58, mar 2020.ISSN 15580792.doi: 10.1109/MSP.2019.2957822.URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7213211/.
Bepler etal. [2019]Tristan Bepler, Andrew Morin, Micah Rapp, Julia Brasch, Lawrence Shapiro, AlexJ. Noble, and Bonnie Berger.Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs.Nature Methods, 16(11):1153–1160, 2019.doi: 10.1038/s41592-019-0575-8.URL https://www.nature.com/articles/s41592-019-0575-8.
Besl and McKay [1992]PaulJ. Besl and NeilD. McKay.Method for registration of 3-D shapes.In Sensor Fusion IV: Control Paradigms and Data Structures, volume 1611, pages 586–606. SPIE, apr 1992.doi: 10.1117/12.57955.URL https://www.spiedigitallibrary.org/conference-proceedings-of-spie/1611/0000/Method-for-registration-of-3-D-shapes/10.1117/12.57955.fullhttps://www.spiedigitallibrary.org/conference-proceedings-of-spie/1611/0000/Method-for-registration-of-3-D-shapes/10.11.
Callaway [2020]Ewen Callaway.Revolutionary cryo-EM is taking over structural biology.Nature, 578(7794):201, feb 2020.ISSN 14764687.doi: 10.1038/D41586-020-00341-9.
De la Rosa-Trevín etal. [2013]J.M. De la Rosa-Trevín, J.Otón, R.Marabini, A.Zaldívar, J.Vargas, J.M. Carazo, and C.O.S. Sorzano.Xmipp 3.0: An improved software suite for image processing in electron microscopy.Journal of Structural Biology, 184(2):321–328, nov 2013.ISSN 10478477.doi: 10.1016/j.jsb.2013.09.015.URL http://www.ncbi.nlm.nih.gov/pubmed/24075951.
Deng etal. [2022]Haowen Deng, Mai Bui, Nassir Navab, Leonidas Guibas, Slobodan Ilic, and Tolga Birdal.Deep Bingham Networks: Dealing with Uncertainty and Ambiguity in Pose Estimation.International Journal of Computer Vision, 130(7):1627–1654, jul 2022.ISSN 15731405.doi: 10.1007/S11263-022-01612-W/METRICS.URL https://link.springer.com/article/10.1007/s11263-022-01612-w.
Deng etal. [2009]Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.ImageNet: A large-scale hierarchical image database.In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, jun 2009.ISBN 978-1-4244-3992-8.doi: 10.1109/CVPR.2009.5206848.URL https://ieeexplore.ieee.org/document/5206848/.
Dhakal etal. [2023]Ashwin Dhakal, Rajan Gyawali, Liguo Wang, and Jianlin Cheng.CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking.bioRxiv, page 2023.02.21.529443, feb 2023.doi: 10.1101/2023.02.21.529443.URL https://www.biorxiv.org/content/10.1101/2023.02.21.529443v1https://www.biorxiv.org/content/10.1101/2023.02.21.529443v1.abstract.
Donnat etal. [2022]Claire Donnat, Axel Levy, Frédéric Poitevin, EllenD. Zhong, and Nina Miolane.Deep generative modeling for volume reconstruction in cryo-electron microscopy.Journal of Structural Biology, 214(4):107920, dec 2022.ISSN 1047-8477.doi: 10.1016/J.JSB.2022.107920.
Egelman [2016]EdwardH. Egelman.The Current Revolution in Cryo-EM.Biophysical Journal, 110(5):1008–1012, mar 2016.ISSN 0006-3495.doi: 10.1016/J.BPJ.2016.02.001.URL https://www.sciencedirect.com/science/article/pii/S0006349516001429?via%3Dihub.
Giri etal. [2024]Nabin Giri, Liguo Wang, and Jianlin Cheng.Cryo2StructData: A Large Labeled Cryo-EM Density Map Dataset for AI-based Modeling of Protein Structures.bioRxiv, page 2023.06.14.545024, jan 2024.doi: 10.1101/2023.06.14.545024.URL https://www.biorxiv.org/content/10.1101/2023.06.14.545024v2https://www.biorxiv.org/content/10.1101/2023.06.14.545024v2.abstract.
Goodfellow etal. [2016]Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning.MIT Press, Cambridge, MA, USA, 2016.ISBN 978-0-262-03561-3.URL http://www.deeplearningbook.org.
Gorski etal. [2004]K.M. Gorski, E.Hivon, A.J. Banday, B.D. Wandelt, F.K. Hansen, M.Reinecke, and M.Bartelman.HEALPix – a Framework for High Resolution Discretization, and Fast Analysis of Data Distributed on the Sphere.The Astrophysical Journal, 622(2):759–771, sep 2004.doi: 10.1086/427976.URL http://arxiv.org/abs/astro-ph/0409513http://dx.doi.org/10.1086/427976.
Grant etal. [2018]Timothy Grant, Alexis Rohou, and Nikolaus Grigorieff.CisTEM, user-friendly software for single-particle image processing.eLife, 7, mar 2018.ISSN 2050084X.doi: 10.7554/eLife.35383.URL https://elifesciences.org/articles/35383.
Gyawali etal. [2023]Rajan Gyawali, Ashwin Dhakal, Liguo Wang, and Jianlin Cheng.CryoVirusDB: A Labeled Cryo-EM Image Dataset for AI-Driven Virus Particle Picking.bioRxiv, page 2023.12.25.573312, dec 2023.doi: 10.1101/2023.12.25.573312.URL https://www.biorxiv.org/content/10.1101/2023.12.25.573312v1https://www.biorxiv.org/content/10.1101/2023.12.25.573312v1.abstract.
Harauz and van Heel [1986]GHarauz and Marin van Heel.Exact filters for general geometry three dimensional reconstruction.Optik, 73, 1986.
He etal. [2016]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition.Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-Decem:770–778, dec 2016.ISSN 10636919.doi: 10.1109/CVPR.2016.90.
Henderson etal. [2012]Richard Henderson, Andrej Sali, MatthewL. Baker, Bridget Carragher, Batsal Devkota, KennethH. Downing, EdwardH. Egelman, Zukang Feng, Joachim Frank, Nikolaus Grigorieff, Wen Jiang, StevenJ. Ludtke, Ohad Medalia, PawelA. Penczek, PeterB. Rosenthal, MichaelG. Rossmann, MichaelF. Schmid, GunnarF. Schröder, AlasdairC. Steven, DavidL. Stokes, JohnD. Westbrook, Willy Wriggers, Huanwang Yang, Jasmine Young, HelenM. Berman, Wah Chiu, GerardJ. Kleywegt, and CatherineL. Lawson.Outcome of the first electron microscopy validation task force meeting.Structure, 20(2):205–214, feb 2012.ISSN 09692126.doi: 10.1016/j.str.2011.12.014.URL http://www.cell.com/article/S0969212612000147/fulltext.
Jiménez-Moreno etal. [2021]A.Jiménez-Moreno, D.Střelák, J.Filipovič, J.M. Carazo, and C.O.S. Sorzano.DeepAlign, a 3D alignment method based on regionalized deep learning for Cryo-EM.Journal of Structural Biology, 213(2):107712, jun 2021.ISSN 10958657.doi: 10.1016/j.jsb.2021.107712.URL https://www.sciencedirect.com/science/article/pii/S1047847721000174.
Kendall and Cipolla [2017]Alex Kendall and Roberto Cipolla.Geometric Loss Functions for Camera Pose Regression with Deep Learning.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017-Janua:6555–6564, jul 2017.ISSN 1063-6919.doi: 10.1109/CVPR.2017.694.URL https://arxiv.org/abs/1704.00390.
Kendall etal. [2015]Alex Kendall, Matthew Grimes, and Roberto Cipolla.PoseNet: A convolutional network for real-time 6-dof camera relocalization.In Proceedings of the IEEE International Conference on Computer Vision, volume 2015 Inter, pages 2938–2946, may 2015.ISBN 9781467383912.doi: 10.1109/ICCV.2015.336.URL http://arxiv.org/abs/1505.07427.
Kimanius etal. [2021]Dari Kimanius, Liyi Dong, Grigory Sharov, Takanori Nakane, and SjorsH.W. Scheres.New tools for automated cryo-EM single-particle analysis in RELION-4.0.The Biochemical journal, 478(24):4169–4185, dec 2021.ISSN 1470-8728.doi: 10.1042/BCJ20210708.URL https://pubmed.ncbi.nlm.nih.gov/34783343/.
Klee etal. [2023]DavidM. Klee, Ondrej Biza, Robert Platt, and Robin Walters.Image to Sphere: Learning Equivariant Features for Efficient Pose Prediction.International Conference on Learning Representations, feb 2023.URL http://arxiv.org/abs/2302.13926.
Levy etal. [2022a]Axel Levy, Frédéric Poitevin, Julien Martel, Youssef Nashed, Ariana Peck, Nina Miolane, Daniel Ratner, Mike Dunne, and Gordon Wetzstein.CryoAI: Amortized Inference of Poses for Ab Initio Reconstruction of 3D Molecular Volumes from Real Cryo-EM Images.In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 13681 LNCS, pages 540–557, mar 2022a.ISBN 9783031198021.doi: 10.1007/978-3-031-19803-8_32.URL https://arxiv.org/abs/2203.08138v4.
Levy etal. [2022b]Axel Levy, Gordon Wetzstein, Julien Martel, Frederic Poitevin, and EllenD. Zhong.Amortized Inference for Heterogeneous Reconstruction in Cryo-EM.Advances in neural information processing systems, 35:13038, dec 2022b.ISSN 10495258.URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10392957/.
Lian etal. [2022]Ruyi Lian, Bingyao Huang, Liguo Wang, Qun Liu, Yuewei Lin, and Haibin Ling.End-to-end orientation estimation from 2D cryo-EM images.Acta Crystallographica Section D: Structural Biology, 78(2):174–186, jan 2022.ISSN 20597983.doi: 10.1107/S2059798321011761.URL https://scripts.iucr.org/cgi-bin/paper?S2059798321011761.
Liu etal. [2019]Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han.On the Variance of the Adaptive Learning Rate and Beyond.8th International Conference on Learning Representations, ICLR 2020, aug 2019.URL https://arxiv.org/abs/1908.03265v4.
Lowe [2004]DavidG. Lowe.Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 60(2):91–110, nov 2004.ISSN 09205691.doi: 10.1023/B:VISI.0000029664.99615.94/METRICS.URL https://link.springer.com/article/10.1023/B:VISI.0000029664.99615.94.
Mahendran etal. [2019]Siddharth Mahendran, Haider Ali, and Rene Vidal.A mixed classification-regression framework for 3D pose estimation from 2D images.In British Machine Vision Conference 2018, BMVC 2018, 2019.
Maluenda etal. [2019]D.Maluenda, T.Majtner, P.Horvath, J.L.L. Vilas, A.Jiménez-Moreno, J.Mota, E.Ramírez-Aportela, R.Sánchez-García, P.Conesa, L.Del Caño, Y.Rancel, Y.Fonseca, M.Martínez, G.Sharov, C.A.A. García, D.Strelak, R.Melero, R.Marabini, J.M.M. Carazo, and C.O.S.O.S. Sorzano.Flexible workflows for on-the-fly electron-microscopy single-particle image processing using Scipion.Acta Crystallographica Section D: Structural Biology, 75(Pt 10):882–894, oct 2019.ISSN 20597983.doi: 10.1107/S2059798319011860.URL http://www.ncbi.nlm.nih.gov/pubmed/31588920http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC6778851.
Melekhov etal. [2017]Iaroslav Melekhov, Juha Ylioinas, Juho Kannala, and Esa Rahtu.Image-Based Localization Using Hourglass Networks.Proceedings - 2017 IEEE International Conference on Computer Vision Workshops, ICCVW 2017, 2018-Janua:870–877, jul 2017.doi: 10.1109/ICCVW.2017.107.URL https://arxiv.org/abs/1703.07971.
Mohlin etal. [2020]David Mohlin, Gérald Bianchi Tobii Danderyd, and Josephine Sullivan.Probabilistic Orientation Estimation with Matrix Fisher Distributions.Advances in Neural Information Processing Systems, 33:4884–4893, 2020.URL https://proceedings.neurips.cc/paper/2020/hash/33cc2b872.
Murphy etal. [2021]Kieran Murphy, Carlos Esteves, Varun Jampani, Srikumar Ramalingam, and Ameesh Makadia.Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold.In Proceedings of Machine Learning Research, volume 139, pages 7882–7893. PMLR, jul 2021.ISBN 9781713845065.URL https://proceedings.mlr.press/v139/murphy21a.html.
Nogales [2015]Eva Nogales.The development of cryo-EM into a mainstream structural biology technique.Nature Methods, 13(1):24–27, dec 2015.ISSN 15487105.doi: 10.1038/nmeth.3694.URL https://www.nature.com/articles/nmeth.3694.pdf?origin=ppub.
Nogales and Scheres [2015]Eva Nogales and SjorsH.W. Scheres.Cryo-EM: A Unique Tool for the Visualization of Macromolecular Complexity, may 2015.ISSN 10974164.URL /pmc/articles/PMC4441764/?report=abstracthttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4441764/.
Passmore and Russo [2016]L.A. Passmore and C.J. Russo.Specimen Preparation for High-Resolution Cryo-EM.Methods in Enzymology, 579:51–86, jan 2016.ISSN 0076-6879.doi: 10.1016/BS.MIE.2016.04.011.
Paszke etal. [2019]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, LuFang, Junjie Bai, and Soumith Chintala.PyTorch: An imperative style, high-performance deep learning library.In Advances in Neural Information Processing Systems, volume32, 2019.doi: 10.5555/3454287.3455008.URL https://dl.acm.org/doi/10.5555/3454287.3455008.
Penczek [2010]PawelA. Penczek.Image restoration in cryo-electron microscopy.In Methods in Enzymology, volume 482, pages 35–72. NIH Public Access, 2010.doi: 10.1016/S0076-6879(10)82002-6.URL /pmc/articles/PMC3166661/.
Penczek etal. [1994]PawelA. Penczek, RobertA. Grassucci, and Joachim Frank.The ribosome at improved resolution: New techniques for merging and orientation refinement in 3D cryo-electron microscopy of biological particles.Ultramicroscopy, 53(3):251–270, mar 1994.ISSN 03043991.doi: 10.1016/0304-3991(94)90038-8.
Prokudin etal. [2018]Sergey Prokudin, Peter Gehler, and Sebastian Nowozin.Pose estimation with uncertainty quantification.In Proceedings of the European conference on computer vision, pages 534–551. Springer Verlag, may 2018.ISBN 9783030012397.doi: 10.1007/978-3-030-01240-3_33.URL https://arxiv.org/abs/1805.03430v1.
Punjani etal. [2017]Ali Punjani, JohnL. Rubinstein, DavidJ. Fleet, and MarcusA. Brubaker.cryoSPARC: algorithms for rapid unsupervised cryo-EM structure determination.Nature Methods 2017 14:3, 14(3):290–296, feb 2017.ISSN 1548-7105.doi: 10.1038/nmeth.4169.URL https://www.nature.com/articles/nmeth.4169.
Rosenthal and Henderson [2003]PeterB. Rosenthal and Richard Henderson.Optimal determination of particle orientation, absolute hand, and contrast loss in single-particle electron cryomicroscopy.Journal of Molecular Biology, 333(4):721–745, oct 2003.ISSN 00222836.doi: 10.1016/j.jmb.2003.07.013.
Sanchez-Garcia etal. [2018]Ruben Sanchez-Garcia, Joan Segura, David Maluenda, JoseMaria Carazo, and Carlos OscarSorzano Sorzano.Deep Consensus, a deep learning-based approach for particle pruning in cryo-electron microscopy Ruben.IUCrJ, 5(Pt 6):854–865, nov 2018.ISSN 20522525.doi: 10.1107/S2052252518014392.URL http://scripts.iucr.org/cgi-bin/paper?S2052252518014392.
Scheres [2012]SjorsH.W. Scheres.RELION: Implementation of a Bayesian approach to cryo-EM structure determination.Journal of Structural Biology, 180(3):519–530, dec 2012.ISSN 10478477.doi: 10.1016/j.jsb.2012.09.006.URL http://www.ncbi.nlm.nih.gov/pubmed/23000701.
Scheres and Chen [2012]SjorsH.W. Scheres and Shaoxia Chen.Prevention of overfitting in cryo-EM structure determination.Nature Methods 2012 9:9, 9(9):853–854, jul 2012.ISSN 1548-7105.doi: 10.1038/nmeth.2115.URL https://www.nature.com/articles/nmeth.2115.
Sorzano etal. [2022]C.O.S. Sorzano, A.Jimenez-Moreno, D.Maluenda, M.Martinez, E.Ramirez-Aportela, J.Krieger, R.Melero, A.Cuervo, J.Conesa, J.Filipovic, P.Conesa, L.Del Cano, Y.C. Fonseca, J.Jiménez-De La Morena, P.Losana, R.Sanchez-Garcia, D.Strelak, E.Fernandez-Gimenez, F.P. De Isidro-Gómez, D.Herreros, J.L. Vilas, R.Marabini, and J.M. Carazo.On bias, variance, overfitting, gold standard and consensus in single-particle analysis by cryo-electron microscopy.Acta Crystallographica Section D: Structural Biology, 78(4):410–423, apr 2022.ISSN 20597983.doi: 10.1107/S2059798322001978/IC5116SUP1.PDF.URL https://scripts.iucr.org/cgi-bin/paper?ic5116https://journals.iucr.org/d/issues/2022/04/00/ic5116/.
Tang etal. [2007]Guang Tang, Liwei Peng, PhilipR. Baldwin, DeepinderS. Mann, Wen Jiang, Ian Rees, and StevenJ. Ludtke.EMAN2: An extensible image processing suite for electron microscopy.Journal of Structural Biology, 157(1):38–46, jan 2007.ISSN 10478477.doi: 10.1016/j.jsb.2006.05.009.URL http://www.ncbi.nlm.nih.gov/pubmed/16859925.
Tegunov and Cramer [2019]Dimitry Tegunov and Patrick Cramer.Real-time cryo-electron microscopy data preprocessing with Warp.Nature Methods, 16(11):1146–1152, nov 2019.ISSN 15487105.doi: 10.1038/s41592-019-0580-y.
Xiang etal. [2014]YuXiang, Roozbeh Mottaghi, and Silvio Savarese.Beyond PASCAL: A benchmark for 3D object detection in the wild.2014 IEEE Winter Conference on Applications of Computer Vision, WACV 2014, pages 75–82, 2014.doi: 10.1109/WACV.2014.6836101.URL https://cvgl.stanford.edu/papers/xiang_wacv14.pdf.
Xiang etal. [2018]YuXiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox.PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes.In Robotics: Science and Systems, 2018.ISBN 9780992374747.doi: 10.15607/RSS.2018.XIV.019.URL https://arxiv.org/abs/1711.00199.
Zenodo [2013]Zenodo.Zenodo: Research. Shared., 2013.URL https://zenodo.org/.

Appendix Appendix A Benchmark composition

EMPIAR ID	Composition	Symmetry	Image pixels	FSCR_0.143 (Å)	MaskedFSCR_0.143 (Å)	# particles
10166	Human 26S proteasome bound to thechemotherapeutic Oprozomib	C1	284	5.0	3.9	238631
10786	Substance P-Neurokinin Receptor Gprotein complexes (SP-NK1R-miniGs399)	C1	184	3.3	3.0*	288659
10280	Calcium-bound TMEM16F in nanodisc withsupplement of PIP2	C2	182	3.6	3.0*	459504
11120	M22 bound TSHR Gs 7TM G protein	C1	232	3.4	3.0*	244973
10409	Replicating SARS-CoV-2 polymerase (Map 1)	C1	240	3.3	3.0*	406001
10374	Human ABCG2 transporter with inhibitor MZ29 and 5D3-Fab	C2	216	3.7	3.0*	323681
10399	Arabinofuranosyltransferase AftDfrom Mycobacteria	C1	184	3.2	3.1	490616
10648	PKM2 in complex with Compound 5	D2	222	3.7	3.3	234956
simulated 10648	Same PKM2 dataset as in 10648, but with simulated images	D2	222	3.5	3.4	138848
consensus 10648	Same PKM2 dataset as in 10648, but with consensus angles	D2	222	3.8	3.4	234956

•
* Nyquist Frequency at 1.5 Å/pixel; Resolution is estimated at the usual threshold 0.143.
•
Reported FSCR_0.143 values were obtained directly from the relion_refine logs while Masked FSCR_0.143 values were collected from the relion_postprocess logs.

Particle poses were estimated using the Relion version 4 auto-refine program[46, 24]. As starting model we used the map obtained with:

relion_reconstruct --pad 2.0 --ctf --i original_poses.star

--sym $SYMMETRY --o reconstructd_map.mrc

The mask was created using:

relion_mask_create --i reconstructed_map.mrc --o mask.mrc --lowpass 15.0

-extend_inimask 3 --width_soft_edge 6 --ini_threshold $THRESHOLD

with $THRESHOLD manually selected for each entry.

The auto-refine command used was

mpirun -np 5 relion_refine_mpi --i original_poses.star --particle_diameter

$DIAMETER --ctf --zero_mask --firstiter_cc --ini_high 40.0

--sym $SYMMETRY --ref reconstructd_map.mrc --norm --scale

--solvent_mask mask.mrc --o outputdir/run --oversampling 1 --flatten_solvent

--solvent_correct_fsc --pad 2 --auto_local_healpix_order 4 --healpix_order 2

--offset_range 5.0 --offset_step 2.0 --auto_refine --split_random_halves

--low_resol_join_halves 40 --dont_combine_weights_via_disc

The simulated dataset was generated with the following command:

relion_project --i original_poses.star --ang original_poses.star

--ang_simulate original_poses.star --o simulated_dir/simulated

--simulate --adjust_simulation_SNR 2.0 --ctf

The consensus dataset was generated using the compare angles protocol from Scipion Xmipp [48, 21], incorporating both our original Relion refinement output and a refinement performed with cisTEM [16]. An angular distance threshold of 5° was employed.

Appendix Appendix B Image2Sphere and training hyperparameters

Our Image2Sphere model follows the implementation of Klee etal. [25] with the following configuration:

•
Feature extractor: ResNet152 [19] with default parameters as implemented in torchvision using imageNet weights. The input images are resized to 256 pixels before being fed, giving a feature map of shape 2048x8x8. Since the input images only contain one channel, but the ResNet expects 3 channels, two additional channels were added by applying a Gaussian filter with sigma 1 and 2 to the input image.
•
Image projector to S2: Default orthographic projector with HEALPix [15] grid order 3 ( $\sim 7.5^{\circ}$ ), where only 50% of the grid points are sampled. The feature map is projected from 2048 channels to 512 using a 1x1 Conv2d and then converted to spherical harmonics with $l_{max}=8$ .
•
S2 convolution: 512 filters with global support on a HEALPix grid of order 3.
•
SO(3) convolution: 16 filters with local support (max_beta= $\pi/8$ , max_gamma= $2\pi$ , n_alpha=8, n_beta=3).
•
Probability distribution discretization: HEALPix grid of order 4 ( $\sim 3.7^{\circ}$ ).

Training was conducted using RAdam [29] as optimizer with an initiallearning rate of 1e-3. A weight decay of 1e-5 was employed. The learning ratewas halved each time the validation loss stagnated during 10 epochs. Thetraining was stopped when the number of epochs reached 400 or the validationloss did not improve for 12 epochs.

Data augmentation was conducted with the following composedtransformations:

•
Random shift from -5% to 5% with probability 0.5.
•
Random rotation from -20° to 20° with probability 0.5.
•
Random 90° rotation with probability 1.
•
Uniform noise addition with a random scale from 0 to 2 with probability 0.2.
•
Gaussian noise addition with a random standard deviation from 0 to 0.5 with probability 0.2.
•
Random zoom-in of size 0% to 5% with probability 0.2.
•
Random erasing of patches of size 0% to 2% with probability 0.1.

Notice that rotation transformations require adjustments in theground truth labels.

Appendix Appendix C cesped package usage example

Dataset instantiation only requires providing the name of the target (a string like "10280") and the half-set number (0 or 1) (Listing 1). ParticlesDatasets can be directly used as datasets in PyTorch DataLoader(s).

⬇

1import torch

2from cesped.particlesDataset import ParticlesDataset

4listOfEntries = ParticlesDataset.getCESPEDEntries()

5targetName, halfset = listOfEntries[0] #We will work with the first example

6dataset = ParticlesDataset(targetName, halfset)

7dl = DataLoader(dataset, batch_size=32, num_workers=4)

8for batch in dl:

9 iid, img, (rotMat, xyShiftAngs, confidence), metadata = batch

11 #iid is the id of the particle (a string)

12 #img is a batch of Bx1xNxN images

13 #rotMat is a batch of rotation matrices Bx3x3

14 #xyShiftAngs is a batch of image shifts in Angstroms Bx2

15 #confidence is a batch of numbers between 0 and 1, Bx1

16 #metata is a dict of names:values with particle information

18 predRot = model(img)

19 loss = loss_function(predRot, rotMat)

20 loss.backward()

21 optimizer.step()

22 optimizer.zero_grad()

ParticlesDataset objects can also be used to update the metadata with newly predicted poses and to save the results in Relion star format, commonly used in Cryo-EM software (Listing 2).

⬇

1for iid, pred_rotmats, maxprob in predictions:

2 #iid is the list of ids of the particles (string)

3 #pred_rotmats is a batch of predicted rotation matrices Bx3x3

4 #maxprob is a batch of numbers, between 0 and 1, Bx1

5 #that indicates the confidence in the prediction (e.g., softmax values)

6 n_preds = pred_rotmats.shape[0]

7 dataset.updateMd(ids=iid, angles=pred_rotmats,

8 shifts=torch.zeros(n_preds, 2),

9 confidence=maxprob,

10 angles_format="rotmat")

11dataset.saveMd("predictions.star") #Save dataset as an starfile

Once the predictions are computed for the two halves of the benchmark entry, evaluation can be automatically computed by providing the starfiles of both predictions via a command line tool(Listing 3) or a function. While you can use your local installation of Relion, we also provide a Singularity definition file so that you do not need to manually install it. See instructions at https://github.com/rsanchezgarc/cesped,

⬇

1python -m cesped.evaluateEntry --predictionType SO3 --targetName 10280

2--half0PredsFname particles_preds_0.star

3--half1PredsFname particles_preds_1.star

4--n_cpus 12 --outdir evaluation/

⬇

2from cesped.evaluateEntry import evaluate

3evaluation_metrics = evaluate(targetName="10280",

4 half0PredsFname="particles_preds_0.star",

5 half1PredsFname="particles_preds_1.star",

6 predictionType="SO3", #Literal["S2", "SO3", "SO3xR2"],

7 usePredConfidence=True,

8 n_cpus=4,

9 outdir="output/directory")

Appendix Appendix D Image2Sphere per-entry results

This section contains per-entry statistics for the Image2Sphere model predictions using the evaluation protocol proposed in the main text. The last two rows correspond to different versions of the 10648 entry and have not been included in the Table 1. In addition to angular error measurements, the other metrics compare the ground truth map ( $\mathrm{GT}$ ) against the map reconstructed from the predicted poses ( $\mathrm{V}$ ), namely the $\mathrm{PCC}(GT,V)$ and $\mathrm{FSCR_{t}}(GT,V)$ , where $t$ denotes the threshold 0.5 or 0.143, where reported. GT is obtained by employing relion_reconstruct on the ground truth poses (that were estimated with relion_refine --auto_refine). The reconstructed map V is generated with relion_reconstruct from the predicted poses.

In addition, we also report half-to-half map metrics, which are commonly employed in traditional Cryo-EM algorithms and in Unsupervised DL methods and can be used to compare them to Supervised DL methods. In particular, we compute $\mathrm{PCC}(GT_{0},GT_{1})$ , $\mathrm{PCC}(V_{0},V_{1})$ , $\mathrm{FSCR_{t}}(GT_{0},GT_{1})$ and $\mathrm{FSCR_{t}}(V_{0},V_{1})$ , where 0 and 1 denote the dataset half. Thus $V_{0}$ is the map reconstructed from the predicted poses of the half dataset 0 using a model trained on the dataset 1. $GT_{0}$ is obtained as GT, but using only the ground truth poses of the half dataset 0.

EMPIAR ID	$\mathrm{MAnE}$ (°)	$\mathrm{wMAnE}$ (°)	$\mathrm{PCC(V_{0},V_{1})}$	$\mathrm{PCC}(GT,V)$	$\mathrm{FSCR_{0.143}(V_{0},V_{1})}$ (Å)	$\mathrm{FSCR_{0.5}(V_{0},V_{1})}$ (Å)	$\mathrm{FSCR_{0.143}(GT,V)}$ (Å)	$\mathrm{FSCR_{0.5}}(GT,V)$ (Å)	$\mathrm{FSCR_{0.143}(GT_{0},GT_{1})}$ (Å)	$\mathrm{FSCR_{0.5}(GT_{0},GT_{1})}$ (Å)	$\mathrm{PCC}(GT_{0},GT_{1})$
10166	15.7	9.1	0.986	0.974	5.1	6.8	6.2	8.1	4.4	4.8	0.992
10786	32.6	29.5	0.957	0.925	3.8	4.3	3.4	7.6	3.1	3.5	0.974
10280	17.8	14.9	0.981	0.957	3.9	4.4	4.3	7.0	3.4	3.8	0.991
11120	44.7	41.1	0.989	0.863	4.1	4.6	6.0	8.3	3.2	3.7	0.965
10409	45.3	39.2	0.960	0.884	3.5	4.0	4.0	8.3	3.0	3.3	0.988
10374	35.0	24.8	0.991	0.969	3.7	4.1	4.1	6.5	3.0	3.5	0.996
10399	25.5	21.6	0.992	0.917	3.7	4.1	4.0	6.1	3.1	3.4	0.996
10648	13.3	10.6	0.982	0.934	3.8	4.1	4.3	6.5	3.4	3.6	0.994
simulated10648	6.0	NA	0.996	0.935	4.5	4.6	4.6	4.8	3.5	4.6	0.998
consensus10648	8.3	8.1	0.971	0.893	3.8	4.1	4.2	6.8	3.4	3.6	0.986

•
$\mathrm{MAnE}$ : Mean Angular Error; $\mathrm{wMAnE}$ : weighted Mean Angular Error; $\mathrm{PCC(V_{0},V_{1})}$ : Reconstructed half-to-half Pearsons’s Correlation Coefficient; $\mathrm{PCC}(GT,V)$ : Reconstructed to ground truth Pearson’s Correlation Coefficient; $\mathrm{FSCR_{0.143}(V_{0},V_{1})}$ : Reconstructed half-to-half FSC resolution at threshold 0.143, and $\mathrm{FSCR_{0.5}(V_{0},V_{1})}$ at threshold 0.5; $\mathrm{FSCR_{0.143}(GT,V)}$ : Reconstructed to ground truth resolution at threshold 0.143, and $\mathrm{FSCR_{0.5}}(GT,V)$ at threshold 0.5; $\mathrm{FSCR_{0.143}(GT_{0},GT_{1})}$ : Ground truth half-to-half FSC resolution at threshold 0.143, and $\mathrm{FSCR_{0.5}(GT_{0},GT_{1})}$ at threshold 0.5. $\mathrm{PCC}(GT_{0},GT_{1})$ : Ground truth half-to-half Pearsons’s Correlation Coefficient.
•
All reported resolutions were obtained using manually computed masks that are available at https://zenodo.org/record/8392782.

Appendix Appendix E Reconstructed volumes

This appendix shows the volumes reconstructed for some of the best performing examples of the Image2Sphere model on our benchmark. As is shown in all the cases, the quality of the central region of the protein is quite close to the one of the ground truth. However, the density for the regions that are at the edges of the macromolecule is much worse. This is in line to what could be expected if there were some degree of inaccuracy in the angular estimation, as the magnitude of the errors in the volume is proportional to both the angular error and the radius of the macromolecule.

a new benchmark for supervised particle pose estimation in Cryo-EM. (4)

a new benchmark for supervised particle pose estimation in Cryo-EM. (5)

Appendix Appendix F Locally refined solution

In this section, we illustrate the usefulness of our approach by showing the effect of classical local refinement on the Image2Sphere results for the benchmark entry 10374. In this case, the Image2Sphere model predicted poses with a $\mathrm{wMAnE}$ of 24.8° that lead to a reconstructed map with $\mathrm{FSCR_{0.143}}(V_{0},V_{1})$ of 3.7 Å.When the predicted poses are used as priors for a local refinement in Relion with --sigma_angle 2.0, the refined map achieved a $\mathrm{PCC}(GT,V)$ of 0.997 compared to the original 0.969, showing that the refined map is much more similar to the ground truth map. Indeed, as it can be appreciated in Supplementary Figure 3, after the local refinement, not only the quality of the core of the protein is comparable to the quality of the ground truth, but also the quality of the distant parts of the maps is much better, almost as good as in the ground truth. Equally important, since we are limiting the angular search to the neighbourhood around the predicted poses ( $\pm$ 6°), the number of image comparisons carried out by Relion is much smaller, resulting in a three-fold speed-up in computational time, even when including the time required for pose inference using the Image2Sphere model.

a new benchmark for supervised particle pose estimation in Cryo-EM. (6)

Appendix Appendix G Running times

The next table collects the running times of Relion autorefine and the Image2Sphere inference using the same hardware configuration (4 Nvidia A100 cards and 32 CPU cores).

EMPIAR ID	Relion (min)	Image2Sphere (min)
10166	521	22
10786	227	12
10280	192	8
11120	102	3
10648	91	5
10409	190	5
10374	133	4

Appendix Appendix H Impact of Particle Misalignment in Map Quality Estimation

In order to study the sensitivity to angular inaccuracy of the map quality estimations used in this work, namely the FSC resolution and Pearson’s Correlation Coefficient (PCC), we ran two experiments. In the first experiment, we added uniform random noise within the ranges of $\pm$ 1°, 3°, and 5° to the Euler angles of each particle (Supplementary Table 4 and Supplementary Figures 4 and 5 right column). In the second experiment we randomised the Euler angles of 10%, 20%, and 30% of the particles for each entry in the benchmark (Supplementary Table 5 and Supplementary Figures 4 and 5 left column). In the absence of symmetry, the expected angular error (geodesic distance) for randomized angles is approximately 126.9°, whereas for the uniform random noise, the expected angular error is of 1.0°, 2.9°, and 4.8° respectively (as estimated through simulation).

Supplementary Table 4 and the right column of the Supplementary Figures 4 and 5 illustrate a clear trend in which increasing angular errors lead to a reduction in the FSC resolution and PCC. Since in this experiment we corrupted the alignment of all particles, this underscores that map global quality measurements are effective proxies for estimating overall mean angular accuracy.

Supplementary Table 5 and the left column of the Supplementary Figures 4 and 5show that as the fraction of misaligned particles increases, both the resolution and the correlation of the maps decreases as well. While it is true that the effect of this type of corruption is smaller than when the angles of all particles are perturbed, it remains noticeable. In most cases, the FSC resolution at threshold 0.5 is clearly different even when as little as 10% of the particles are perturbed. Given that the number of misaligned particles in refined maps using methods such as Relion is quite large, with some cases reporting up to 60% misalignment levels², the sensitivity of the FSC resolution should be enough to compare the accuracy of different methods. A similar trend is observed in the PCC values, which steadily decline as the fraction of misaligned particles increases.

These two experiments confirm that it is possible to distinguish between different levels of alignment corruption using map quality measurements; hence, they serve as sensible proxies for assessing angular alignment accuracy. However, these measurements are not directly comparable across different samples; thus, comparisons are only valid when examining different alignment results for the same sample, as we do in this benchmark.

Entry	PCC				Masked PCC
	0°	1º	3º	5º	0º	1º	3º	5º
10166	0.981	0.950	0.921	0.894	0.995	0.989	0.968	0.941
10280	0.978	0.946	0.911	0.887	0.995	0.991	0.972	0.949
10374	0.980	0.950	0.925	0.908	0.998	0.996	0.980	0.962
10409	0.960	0.914	0.833	0.781	0.976	0.951	0.886	0.835
10648	0.966	0.916	0.868	0.821	0.997	0.995	0.969	0.917
10786	0.931	0.840	0.749	0.700	0.938	0.858	0.772	0.722
11120	0.852	0.571	0.433	0.391	0.905	0.736	0.606	0.555

Entry	PCC				Masked PCC
	0%	10%	20%	30%	0%	10%	20%	30%
10166	0.981	0.971	0.956	0.935	0.995	0.992	0.985	0.975
10280	0.978	0.964	0.946	0.926	0.995	0.988	0.975	0.958
10374	0.980	0.971	0.959	0.945	0.998	0.995	0.987	0.976
10409	0.960	0.944	0.924	0.899	0.976	0.965	0.949	0.928
10648	0.966	0.941	0.909	0.875	0.997	0.988	0.970	0.949
10786	0.931	0.904	0.874	0.838	0.938	0.914	0.886	0.853
11120	0.852	0.805	0.753	0.699	0.905	0.872	0.834	0.792