This page was generated from docs/examples/2_mergingstats.ipynb. Interactive online version: Binder badge.

HEWL S-SAD Merging Statistics

Merging statistics are a useful means to assess data quality in crystallography. However, each statistic has inherent shortcomings. For example, R-merge will appear inflated if the multiplicity is high, and the Pearson correlation coefficients used for \(CC_{1/2}\) are very sensitive to outliers.

Most scaling and merging programs output multiple merging statistics to get around these shortcomings. However, one can imagine that it could also be useful to customize certain parameters, such as how many resolution bins are used. Or, perhaps a better statistic will be developed that is worth implementing.

In this notebook, reciprocalspaceship is used to compute half-dataset correlation coefficients (\(CC_{1/2}\) and \(CC_{anom}\)) for a dataset collected from a tetragonal hen egg-white lysozyme (HEWL) crystal at 6550 eV. These data are unmerged, but were scaled in AIMLESS. They contain sufficient sulfur anomalous signal to determine a solution by the SAD method. This illustrates the use of rs to implement a merging routine, create a custom analysis, and could be useful as a template for other exploratory crystallographic data analyses. As an example, we will compare half-dataset correlations computed using both Pearson and Spearman correlation coefficients.

[1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("notebook", font_scale=1.3)
import numpy as np
[2]:
import reciprocalspaceship as rs
[3]:
print(rs.__version__)
0.9.9

Load scaled, unmerged data

This data has been scaled in AIMLESS. The data includes the image number and the scaled I and SIGI values.

[4]:
hewl = rs.read_mtz("data/HEWL_unmerged.mtz")
[5]:
hewl.head()
[5]:
BATCH I SIGI PARTIAL
H K L
0 0 4 137 696.5212 87.83294 False
4 520 710.6812 88.107025 False
4 856 672.05634 87.75671 False
4 1239 642.47485 87.90302 False
4 2160 655.71783 87.74394 False
[6]:
print(f"Number of observed reflections: {len(hewl)}")
Number of observed reflections: 816804

Merging with Inverse-Variance Weights

Since the input data are unmerged, we will implement the inverse-variance weighting scheme used by AIMLESS to merge the observations. The weighted average is a better estimator of the true mean than the raw average, and this weighting scheme corresponds to the maximum likelihood estimator of the true mean if we assume that the observations are normally-distributed about the true mean.

The merged intensity for each reflection, \(I_h\), can be determined from the observed intensities, \(I_{h,i}\), and error estimates, \(\sigma_{h,i}\), as follows:

\begin{equation} I_h = \frac{\sum_{i}w_{h,i} I_{h,i}}{\sum_{i} w_{h,i}} \end{equation}

where the weight for each observation, \(w_{h,i}\) is given by:

\begin{equation} w_{h,i} = \frac{1}{(\sigma_{h,i})^2} \end{equation}

The updated estimate of the uncertainty, \(\sigma_{h}\), is given by:

\begin{equation} \sigma_{h} = \sqrt{\frac{1}{\sum_{i} w_{h,i}}} \end{equation}

Let’s start by implementing the above equations in a function that will compute the merged \(I_h\) and \(\sigma_h\). We will use the Pandas groupby methods to apply this function across the unique Miller indices in the DataSet. If we group Friedel pairs together, we will refer to the quantity as IMEAN, and if we keep the Friedel pairs separate we will refer to the quantities as I(+) and I(-).

[7]:
def merge(dataset, anomalous=False):
    """
    Merge dataset using inverse-variance weights.

    Parameters
    ----------
    dataset : rs.DataSet
        DataSet to be merged containing scaled I and SIGI
    anomalous : bool
        If True, I(+) and I(-) will be reported. If False,
        IMEAN will be reported

    Returns
    -------
    rs.DataSet
        Merged DataSet object
    """
    ds = dataset.hkl_to_asu(anomalous=anomalous)
    ds["w"]  = ds['SIGI']**-2
    ds["wI"] = ds["I"] * ds["w"]
    g = ds.groupby(["H", "K", "L"])

    result = g[["w", "wI"]].sum()
    result["I"] = result["wI"] / result["w"]
    result["SIGI"] = np.sqrt(1 / result["w"])
    result = result.loc[:, ["I", "SIGI"]]
    result.merged = True

    if anomalous:
        result = result.unstack_anomalous()

    return result

Using anomalous=False, this function can be used to compute IMEAN and SIGIMEAN by including both Friedel pairs:

[8]:
result1 = merge(hewl, anomalous=False)
[9]:
result1.sample(5)
[9]:
I SIGI
H K L
23 22 9 1349.8007 22.878157
37 19 0 260.0996 9.075029
17 11 15 146.02324 2.895325
35 1 14 9.629998 3.902024
15 5 18 7.837115 1.039951

Using anomalous=True, this function can be used to compute I(+), SIGI(+), I(-), and SIGI(-) by separating Friedel pairs:

[10]:
result2 = merge(hewl, anomalous=True)
[11]:
result2.sample(5)
[11]:
I(+) SIGI(+) SIGI(-) I(-)
H K L
20 4 7 790.00696 13.43943 13.45224 819.43054
21 15 16 51.543583 2.840802 2.675156 50.629883
35 8 12 56.64869 2.762924 2.779431 52.87201
16 16 15 517.4729 11.513449 11.513449 517.4729
33 23 3 101.954094 3.753756 3.605523 89.131714

A variant of the above function is implemented in rs.algorithms, and we will use that implementation in the next section for computing merging statistics. This function computes IMEAN, I(+), I(-), and associated uncertainties for each unique Miller index.

[12]:
result3 = rs.algorithms.merge(hewl)
[13]:
result3.sample(5)
[13]:
IMEAN SIGIMEAN I(+) SIGI(+) I(-) SIGI(-) N(+) N(-)
H K L
38 10 4 454.0382 11.617793 416.8525 21.370535 469.6386 13.841894 8 20
17 9 15 101.308815 2.1820927 100.45327 3.0412343 102.21658 3.1326878 32 31
38 7 8 4.571434 0.9255639 4.8481956 1.2750458 4.2631493 1.3457003 20 20
11 5 5 3991.629 47.690647 4061.3477 66.316765 3916.9563 68.63234 60 56
27 18 11 14.8991 1.0327507 10.999734 1.4392322 19.038132 1.4828024 20 20

Merging with 2-fold Cross-Validation

To compute correlation coefficients we will repeatedly split our data into half-datasets. We will do this by randomly splitting the data using the image number. These half-datasets will be merged independently and used to determine uncertainties in the correlation coefficients. We will first write a method to randomly split our data, and then we will then write a method that automates the sampling and merging of multiple half-datasets.

[14]:
def sample_halfdatasets(data):
    """Randomly split DataSet into two equal halves by BATCH"""
    batch = data.BATCH.unique().to_numpy(dtype=int)
    np.random.shuffle(batch)
    halfbatch1, halfbatch2 = np.array_split(batch, 2)
    half1 = data.loc[data.BATCH.isin(halfbatch1)]
    half2 = data.loc[data.BATCH.isin(halfbatch2)]
    return half1, half2
[15]:
def merge_dataset(dataset, nsamples):
    """
    Merge DataSet using inverse-variance weighting scheme. This represents the
    maximum-likelihood estimator of the mean of the observed intensities assuming
    they are independent and normally distributed with the same mean.

    Sample means across half-datasets can be used to compute the merging statistics CC1/2 and CCanom.
    """
    dataset = dataset.copy()
    samples = []
    for n in range(nsamples):
        half1, half2 = sample_halfdatasets(dataset)
        mergedhalf1 = rs.algorithms.merge(half1)
        mergedhalf2 = rs.algorithms.merge(half2)
        result = mergedhalf1.merge(mergedhalf2, left_index=True, right_index=True, suffixes=(1, 2))
        result["sample"] = n
        samples.append(result)
    return rs.concat(samples).sort_index()

Merge HEWL data

We will now merge the HEWL data, repeatedly sampling across half-datasets in order to assess the distribution of correlation coefficients.

[16]:
# This cell takes a few minutes with nsamples=15
merged = merge_dataset(hewl, 15)
[17]:
merged
[17]:
IMEAN1 SIGIMEAN1 I(+)1 SIGI(+)1 I(-)1 SIGI(-)1 N(+)1 N(-)1 IMEAN2 SIGIMEAN2 I(+)2 SIGI(+)2 I(-)2 SIGI(-)2 N(+)2 N(-)2 sample
H K L
0 0 4 657.99817 43.882294 657.99817 43.882294 657.99817 43.882294 4 4 662.40204 25.35386 662.40204 25.353859 662.40204 25.353859 12 12 0
4 645.79663 43.884262 645.79663 43.884266 645.79663 43.884266 4 4 666.4745 25.35348 666.4745 25.35348 666.4745 25.35348 12 12 1
4 662.0198 25.348951 662.0198 25.348951 662.0198 25.348951 12 12 659.13995 43.907776 659.13995 43.907776 659.13995 43.907776 4 4 2
4 660.4694 33.18889 660.4694 33.18889 660.4694 33.18889 7 7 661.946 29.271538 661.946 29.271538 661.946 29.271538 9 9 3
4 666.3023 24.359608 666.3023 24.35961 666.3023 24.35961 13 13 639.66833 50.654987 639.66833 50.654987 639.66833 50.654987 3 3 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
45 10 2 30.170218 7.3511124 30.170218 7.3511124 NaN NaN 1 0 15.318142 3.9687188 15.318142 3.9687188 NaN NaN 3 0 9
2 7.669444 6.6297607 7.669444 6.6297607 NaN NaN 1 0 22.89468 4.1084785 22.89468 4.1084785 NaN NaN 3 0 10
2 11.665072 5.2794247 11.665072 5.2794247 NaN NaN 2 0 24.119883 4.656634 24.119883 4.656634 NaN NaN 2 0 11
2 7.669444 6.6297607 7.669444 6.6297607 NaN NaN 1 0 22.89468 4.1084785 22.89468 4.1084785 NaN NaN 3 0 12
2 17.961912 4.288131 17.961912 4.288131 NaN NaN 3 0 20.064919 6.0180674 20.064919 6.0180674 NaN NaN 1 0 13

187364 rows × 17 columns


Compute \(CC_{1/2}\) and \(CC_{anom}\)

We will first assign each reflection to a resolution bin and then we will compute the correlation coefficients

[18]:
merged, labels = merged.assign_resolution_bins(bins=15)
[19]:
groupby1 = merged.groupby(["sample", "bin"])[["IMEAN1", "IMEAN2"]]
pearson1 = groupby1.corr(method="pearson").unstack().loc[:, ("IMEAN1", "IMEAN2")]
pearson1.name = "Pearson"
spearman1 = groupby1.corr(method="spearman").unstack().loc[:, ("IMEAN1", "IMEAN2")]
spearman1.name = "Spearman"
results1 = rs.concat([pearson1, spearman1], axis=1)
results1 = results1.groupby("bin").agg(["mean", "std"])
[20]:
results1.head()
[20]:
Pearson Spearman
mean std mean std
bin
0 0.997334 0.000316 0.998895 0.000063
1 0.997958 0.000360 0.999363 0.000066
2 0.998971 0.000306 0.999678 0.000023
3 0.999297 0.000102 0.999658 0.000024
4 0.999368 0.000147 0.999687 0.000022
[21]:
plt.figure(figsize=(8, 4))
plt.errorbar(results1.index, results1[("Pearson", "mean")],
             yerr=results1[("Pearson", "std")],
             color="#1b9e77",
             label=r"$CC_{1/2}$ (Pearson)")
plt.errorbar(results1.index, results1[("Spearman", "mean")],
             yerr=results1[("Spearman", "std")],
             color="#d95f02",
             label=r"$CC_{1/2}$ (Spearman)")
plt.xticks(results1.index, labels, rotation=45, ha='right', rotation_mode='anchor')
plt.ylabel("Correlation Coefficient")
plt.xlabel(r"Resolution Bin ($\AA$)")
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.grid(axis="y", linestyle='--')
plt.show()
../_images/examples_2_mergingstats_29_0.png

It is important to note the scale on the y-axis – this dataset is edge-limited, and as such the \(CC_{1/2}\) is very high across all resolution bins. The Spearman CC appears higher across all resolution bins except at high resolution, and overall has a lower standard deviation among samples. This is consistent with our expectation that Spearman CCs are a more robust estimator of correlation than Pearson CCs.

Let’s now repeat this for the anomalous data, computing \(CC_{anom}\):

[22]:
merged["ANOM1"] = merged["I(+)1"] - merged["I(-)1"]
merged["ANOM2"] = merged["I(+)2"] - merged["I(-)2"]

# Similar to CChalf, but we will only look at acentric reflections
groupby2 = merged.acentrics.groupby(["sample", "bin"])[["ANOM1", "ANOM2"]]
pearson2 = groupby2.corr(method="pearson").unstack().loc[:, ("ANOM1", "ANOM2")]
pearson2.name = "Pearson"
spearman2 = groupby2.corr(method="spearman").unstack().loc[:, ("ANOM1", "ANOM2")]
spearman2.name = "Spearman"
results2 = rs.concat([pearson2, spearman2], axis=1)
results2 = results2.groupby("bin").agg(["mean", "std"])
[23]:
results2.head()
[23]:
Pearson Spearman
mean std mean std
bin
0 0.422196 0.037998 0.497923 0.029704
1 0.233349 0.061975 0.328263 0.023917
2 0.357182 0.051746 0.444657 0.029250
3 0.484710 0.045454 0.543314 0.031731
4 0.579058 0.023952 0.611739 0.020141
[24]:
plt.figure(figsize=(8, 4))
plt.errorbar(results2.index, results2[("Pearson", "mean")],
             yerr=results2[("Pearson", "std")],
             color="#1b9e77",
             label=r"$CC_{anom}$ (Pearson)")
plt.errorbar(results2.index, results2[("Spearman", "mean")],
             yerr=results2[("Spearman", "std")],
             color="#d95f02",
             label=r"$CC_{anom}$ (Spearman)")
plt.xticks(results2.index, labels, rotation=45, ha='right', rotation_mode='anchor')
plt.ylabel("Correlation Coefficient")
plt.xlabel(r"Resolution Bin ($\AA$)")
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.grid(axis="y", linestyle='--')
plt.show()
../_images/examples_2_mergingstats_33_0.png

There is significant anomalous signal across all but the highest resolution bins. The Spearman CCs have smaller error bars than the corresponding Pearson CCs and the Spearman CCs are also higher in most bins. These differences highlight the influence of outlier measurements on the different correlation coefficients.


Summary

We have used reciprocalspaceship to merge a dataset using inverse-variance weights. As part of this analysis, we performed repeated 2-fold cross-validation to compute Pearson and Spearman correlation coefficients and associated uncertainties. This relatively simple procedure to obtain uncertainty estimates for correlation coefficients is seldom done when analyzing merging quality. However, we can see that the standard deviation for computed correlation coefficients can be nearly \(\pm0.1\) for quantities such as \(CC_{anom}\). This is worth keeping in mind when analyzing SAD experiments because this dataset is very high quality (edge-limited) when one considers the \(CC_{1/2}\). Putting these two correlation coefficients on the same axes emphasizes this point:

[25]:
plt.figure(figsize=(9, 6))
plt.errorbar(results1.index, results1[("Pearson", "mean")],
             yerr=results1[("Pearson", "std")],
             color='#1b9e77',
             label=r"$CC_{1/2}$ (Pearson)")
plt.errorbar(results1.index, results1[("Spearman", "mean")],
             yerr=results1[("Spearman", "std")],
             color='#d95f02',
             label=r"$CC_{1/2}$ (Spearman)")
plt.errorbar(results2.index, results2[("Pearson", "mean")],
             yerr=results2[("Pearson", "std")],
             color='#1b9e77',
             linestyle="--",
             label=r"$CC_{anom}$ (Pearson)")
plt.errorbar(results2.index, results2[("Spearman", "mean")],
             yerr=results2[("Spearman", "std")],
             color='#d95f02',
             linestyle="--",
             label=r"$CC_{anom}$ (Spearman)")
plt.xticks(results1.index, labels, rotation=45, ha='right', rotation_mode='anchor')
plt.ylabel(r"Correlation Coefficient")
plt.xlabel(r"Resolution Bin ($\AA$)")
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.grid(axis="y", linestyle='--')
plt.tight_layout()
../_images/examples_2_mergingstats_36_0.png

Even a simple change to the procedure for computing merging statistics, such as substituting a Spearman correlation coefficient for a Pearson one, can alter the apparent quality of a dataset. By lowering the barrier to implementing new analyses, we hope that reciprocalspaceship can encourage the development of more robust indicators of crystallographic data quality.