This page was generated from docs/examples/3_mergingerrormodel.ipynb. Interactive online version: Binder badge.

Merging with a Robust Error Model

In the previous example, we computed the common merging statistics \(CC_{1/2}\) and \(CC_{anom}\) to explore a dataset with significant anomalous signal from native sulfur atoms. One assumption that we made while merging is that the scaled reflection observations are normally distributed about the mean. This assumption is consistent with the merging strategy used by AIMLESS, which was used to scale the data in the first place. In this example, we will explore whether the scaled reflection observations are normally distributed, and whether we can improve the anomalous signal (\(CC_{anom}\)) by using a different error model.

Note: See pytorch.org for customizable PyTorch installation instructions.

[1]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set_context("notebook", font_scale=1.3)
from tqdm.notebook import tqdm
[2]:
# Install PyTorch to running kernel
import sys
!{sys.executable} -m pip install torch
Requirement already satisfied: torch in /Users/jgreisman/miniconda3/envs/test/lib/python3.9/site-packages (1.10.1)
Requirement already satisfied: typing-extensions in /Users/jgreisman/miniconda3/envs/test/lib/python3.9/site-packages (from torch) (4.0.1)
[3]:
import torch
[4]:
import reciprocalspaceship as rs
[5]:
print(rs.__version__)
0.9.18

Normal Error Model

It is common in merging to assume that scaled intensities are normally distributed about the true mean. We can assess the validity of this assumption by looking at the residuals between scaled intensities and the maximum likelihood esimate of their true intensity.

[6]:
ds = rs.read_mtz("data/HEWL_unmerged.mtz")
[7]:
merged_normal = rs.algorithms.merge(ds)
merged_normal = merged_normal.stack_anomalous()
[8]:
ds.hkl_to_asu(anomalous=True, inplace=True)
ds["IML"] = merged_normal.loc[ds.index, "I"]
[9]:
# Expected residuals for normally distributed data
x = np.linspace(-6., 6., 50)
bin_width = x[1] - x[0]
normal = bin_width*ds.shape[0]*(1/np.sqrt(2*np.pi))*np.exp(-0.5*x**2)

# Histogram Residuals
fig = plt.figure(figsize=(8, 5))
sns.histplot((ds.I - ds.IML)/ds.SIGI, bins=x, color="k", label="Residuals")
plt.plot(x, normal, 'r-', lw=2, label="Expected")
plt.ylabel("Log Count")
plt.yscale("log")
plt.ylim(1e0, 3e5)
plt.xlabel(r"$\frac{I_{h,i} - I^{ML}_{h}}{\sigma_{I_{h,i}}}$", fontsize=24)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.tight_layout()
plt.show()
../_images/examples_3_mergingerrormodel_10_0.png

These residuals are not symmetric about \(0\), which would have been expected for truly normally distributed data. In addition, the tails are much “heavier” than for a normal distribution. This suggests that it may be possible to do a better job merging these scaled observations by using a different error model that is more robust to outliers.


Generalized Merging with Flexible Error Model

The inverse-variance weighting scheme implemented in rs.algorithms.merge() is the maximum likelihood estimator for the true mean if we assume the observations are normally distributed. However, we can write a more general form for the maximum likelihood estimator for the mean of each intensity distribution, \(\mu \in\mathbb{R}^{|\mathbf{H}|}\), without assuming a specific distribution for the error model. Therefore, we will maximize the probability of the data given the model \begin{align*} P(data | model) &= \prod_{h,i}P(I_{h,i}|\mu_h, \sigma_{I_{h,i}}) \end{align*} by minimizing the negative log likelihood \begin{align*} \mathcal{L} &\triangleq -\log P(data | model) \\ &= -\sum_{h,i}\log P(I_{h,i}|\mu_h, \sigma_{I_{h,i}}) \\ I^{ML} &= \underset{\mu}{\mathrm{argmin}}\ -\sum_{h,i} \log P(I_{h,i} | \mu_h, \sigma_{I_{h,i}}) \end{align*} with respect to \(\mu\). With this formulation, it is possible to supply any parametric form for the error model belonging to the location-scale family of distributions. This maximum likelihood estimator is implemented in the function below.

[10]:
def merge_mle(ds, distribution, *args, lr=.5, progress_bar=True, return_loss=False, **kwargs):
    """
    Merge observations using the provided distribution as an error model.
    Additional arguments or keyword arguments will be passed to the PyTorch
    distribution constructor.

    Parameters
    ----------
    ds : rs.DataSet
        Scaled, unmerged observations
    distribution : torch.distributions.Distribution
        PyTorch distribution to use as error model
    lr : float
        Learning rate for Adam optimizer
    progress_bar : bool
        Whether to display a progress bar for optimization
    return_loss : bool
        Whether to return the loss function values during optimization

    Returns
    -------
    rs.DataSet or (rs.DataSet, losses)
        Merged DataSet with or without list of loss function values
    """
    # Compute MLE with normal error model
    ds = ds.copy()
    mle_norm = rs.algorithms.merge(ds).stack_anomalous()
    ds["IML"] = mle_norm.loc[ds.index, "I"]

    # Observed intensities and error estimates
    groupby = ds.groupby(['H', 'K', 'L'])
    idx = groupby.ngroup().to_numpy()
    I = torch.as_tensor(ds.I.to_numpy())
    SigI = torch.as_tensor(ds.SIGI.to_numpy())

    # Initialize optimization at MLE with normal error model
    mle = groupby.first()["IML"].to_numpy()
    mean = torch.tensor(mle, requires_grad=True)

    # Define loss function
    def _evaluate_loss():
        return -torch.sum(distribution(*args, loc=mean[idx], scale=SigI, **kwargs).log_prob(I))

    # Setup and fit model
    losses = []
    opt = torch.optim.Adam([mean], lr=lr)
    for _ in tqdm(range(300), disable=not progress_bar):
        opt.zero_grad()
        loss = _evaluate_loss()
        losses.append(loss.detach())
        loss.backward()
        opt.step()
    grad = torch.autograd.grad(_evaluate_loss(), mean, create_graph=True)[0]
    hess = torch.autograd.grad(grad.sum(), mean, create_graph=True)[0]

    # Package results
    results = rs.DataSet({'I': mean.detach().numpy(),
                          'SIGI': np.sqrt(1./hess.detach().numpy())},
                         index=groupby.first().index,
                         spacegroup=ds.spacegroup,
                         cell=ds.cell,
                         merged=True)
    results.infer_mtz_dtypes(inplace=True)

    if return_loss:
        return results.unstack_anomalous(), losses
    return results.unstack_anomalous()

This function must be passed a torch.distribution.Distribution to use as the error model. It then uses the Adam optimizer to minimize the negative log-likelihood and fit the merged intensities, \(I^{ML}\). For stability, the model is initialized using the mean intensity values from rs.algorithms.merge(). The following cell fits the model using a Student’s t-distributed error model with df=4.0.

[11]:
result1, loss1 = merge_mle(ds, torch.distributions.StudentT, 4.0,
                           progress_bar=True, return_loss=True)

# Plot loss function
plt.plot(loss1)
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.show()
../_images/examples_3_mergingerrormodel_15_1.png
[12]:
result1.sample(5)
[12]:
I(+) SIGI(+) I(-) SIGI(-)
H K L
18 14 9 13.383095 0.59039783 12.244447 0.5472484
25 24 1 481.62515 16.697405 487.13596 16.214811
39 13 7 62.35997 3.6021707 61.64068 3.1317017
29 29 5 235.86664 8.579554 235.86664 8.579554
30 14 5 174.9836 5.340581 189.3732 4.8809543

As a sanity check we can test this implementation by passing the normal distribution as the error model. In this case, the initial values should match the maximum likelihood estimate and optimization should not change the estimates.

[13]:
result2, loss2 = merge_mle(ds, torch.distributions.Normal, lr=1e-3,
                           progress_bar=True, return_loss=True)
result2 = result2[["I(+)", "SIGI(+)", "I(-)", "SIGI(-)"]].stack_anomalous()

# Plot loss function
fig, ax = plt.subplots(ncols=2, figsize=(12, 5.5))
ax[0].plot(loss2)
ax[0].set_ylim(4.15e6, max(loss1))
ax[0].set_xlabel("Iteration")
ax[0].set_ylabel("Loss")
ax[0].set_title("Loss Function")
ax[1].loglog(merged_normal["I"].to_numpy(), result2.loc[merged_normal.index, "I"].to_numpy(), "k.", alpha=0.5)
ax[1].set_xlabel(r"$I^{ML}$ (AIMLESS)")
ax[1].set_ylabel(r"$I^{ML}$ (PyTorch)")
plt.tight_layout()
plt.show()
../_images/examples_3_mergingerrormodel_18_1.png

We can see here that the loss function does not change during the optimization, and that the final maximum likelihood estimates for the intensities are equivalent to the input. This validates that the merging function is working as expected.


Assess Student’s t-Distributed Error Model

A Student’s t-distribution is useful in modeling data that contain outliers. This distribution places more density in the tails than a normal which makes maximum likelihood estimates robust to outlying measurements. It is parameterized by a degree of freedom, \(\nu\), which can take a value between \([0, \infty)\) and controls the heaviness of the tails. As \(\nu\to\infty\), the probability density function approaches a normal distribution.

Below, we will set up a few helper functions for merging our data within randomly partitioned half-datasets. We will do this for the Student’s t-distribution scanning several degrees of freedom and for the normal distribution so that we can compare the results. Based on the previous example, we will compute \(CC_{anom}\) using a Spearman correlation coefficient and repeated 2-fold cross-validation to compare the different error models.

[14]:
def sample_halfdatasets(data):
    """Randomly split DataSet into two equal halves by BATCH"""
    batch = data.BATCH.unique().to_numpy(dtype=int)
    np.random.shuffle(batch)
    halfbatch1, halfbatch2 = np.array_split(batch, 2)
    half1 = data.loc[data.BATCH.isin(halfbatch1)]
    half2 = data.loc[data.BATCH.isin(halfbatch2)]
    return half1, half2
[15]:
def merge_dataset(dataset, nsamples, distribution, *args, **kwargs):
    """
    Merge dataset with repeated 2-fold cross-validation using `distribution`
    as error model.
    """
    dataset = dataset.copy()
    samples = []
    for n in tqdm(range(nsamples)):
        half1, half2 = sample_halfdatasets(dataset)
        mergedhalf1 = merge_mle(half1, distribution, *args, progress_bar=False, **kwargs)
        mergedhalf2 = merge_mle(half2, distribution, *args, progress_bar=False, **kwargs)
        result = mergedhalf1.merge(mergedhalf2, on=["H", "K", "L"], suffixes=(1, 2))
        result["sample"] = n
        samples.append(result)
    return rs.concat(samples).sort_index()
[16]:
def merge_dataset_normal(dataset, nsamples):
    """
    Merge dataset with repeated 2-fold cross-validation using normal distribution
    as error model.
    """
    dataset = dataset.copy()
    samples = []
    for n in tqdm(range(nsamples)):
        half1, half2 = sample_halfdatasets(dataset)
        mergedhalf1 = rs.algorithms.merge(half1)
        mergedhalf2 = rs.algorithms.merge(half2)
        result = mergedhalf1.merge(mergedhalf2, on=["H", "K", "L"], suffixes=(1, 2))
        result["sample"] = n
        samples.append(result)
    return rs.concat(samples).sort_index()

Note: Using these settings, the following will take ~15 min to complete. For the pre-print, this was run with nsamples=15. It is possible to run into occasional numerical instabilities when computing the Hessian for df=4.0, but these will only impact \(\sigma_I^{ML}\) estimates which are not used here.

[17]:
merged_df4 = merge_dataset(ds, 5, torch.distributions.StudentT, 4.0)
merged_df8 = merge_dataset(ds, 5, torch.distributions.StudentT, 8.0)
merged_df16 = merge_dataset(ds, 5, torch.distributions.StudentT, 16.0)
merged_df32 = merge_dataset(ds, 5, torch.distributions.StudentT, 32.0)
merged_df64 = merge_dataset(ds, 5, torch.distributions.StudentT, 64.0)
[18]:
merged_dfinf = merge_dataset_normal(ds, 5)

Now that we have merged the data using Student’s t-distributions and a normal distribution as the error model, we can compute \(CC_{anom}\) and plot the results by resolution bin.

[19]:
def computeCC(merged, bins=12):
    """Compute CCanom for input data"""
    merged, labels = merged.assign_resolution_bins(bins=bins)
    merged["ANOM1"] = merged["I(+)1"] - merged["I(-)1"]
    merged["ANOM2"] = merged["I(+)2"] - merged["I(-)2"]
    groupby = merged.acentrics.groupby(["sample", "bin"])[["ANOM1", "ANOM2"]]
    spearman = groupby.corr(method="spearman").unstack().loc[:, ("ANOM1", "ANOM2")]
    results = spearman.groupby("bin").agg(["mean", "std"])
    return results, labels
[20]:
resultsinf, labels = computeCC(merged_dfinf)
results4, labels = computeCC(merged_df4)
results8, labels = computeCC(merged_df8)
results16, labels = computeCC(merged_df16)
results32, labels = computeCC(merged_df32)
results64, labels = computeCC(merged_df64)
[21]:
def plot(results, label, ax, color=None):
    ax.errorbar(results.index, results["mean"], yerr=results["std"], color=color, label=label)
    return

fig = plt.figure(figsize=(10, 6))
ax = fig.gca()
with sns.color_palette("viridis", 5) as palette:
    plot(results4, r"Student-T ($d.f.=4$)", ax, color=palette[4])
    plot(results8, r"Student-T ($d.f.=8$)", ax, color=palette[3])
    plot(results16, r"Student-T ($d.f.=16$)", ax, color=palette[2])
    plot(results32, r"Student-T ($d.f.=32$)", ax, color=palette[1])
    plot(results64, r"Student-T ($d.f.=64$)", ax, color=palette[0])
plot(resultsinf,"Normal", ax, color="k")
plt.ylabel(r"$CC_{anom}$ (Spearman)")
plt.xlabel(r"Resolution Bin ($\AA$)")
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), title="Error Model")
plt.ylim(0, 0.72)

plt.xticks(resultsinf.index, labels, rotation=45, ha='right', rotation_mode='anchor')
plt.grid(axis="y", linestyle='--')
plt.tight_layout()
plt.show()
../_images/examples_3_mergingerrormodel_30_0.png

Summary

This dataset was scaled and merged in AIMLESS, which involves rounds of outlier rejection and implicitly assumes that the intensities are normally distributed about the true mean. In the first section, we observed that the residuals from merging are not normally distributed, suggesting that the normal error model may have been suboptimal. We implemented a more general maximum likelihood-based approach for merging data using different probability distributions and used it to evaluate the performance of Student’s t-distributed error models. This seemed like a reasonable starting point because it is often more robust to outliers than a normal distribution.

Using repeated 2-fold cross-validation, we saw that a Student’s t-distribution with a low degree of freedom (\(\nu=4\)) outperforms the normally distributed error model when assessed using \(CC_{anom}\). Furthermore, the performance seems to approach that of the normal distribution when the degree of freedom is increased, which is expected since the t-distribution approaches a normal disitrbution as \(\nu\to\infty\).

Although this dataset is quite high quality and was used to phase and refine a model from the native sulfur SAD signal, it still shows that there can be an incremental improvement in \(CC_{anom}\) from revisiting assumptions regarding error models during merging. This model is implemented in ~40 lines of code using PyTorch, and can be quickly applied to any dataset of interest. By lowering the barrier to implementing such models, reciprocalspaceship makes it easy to try new analyses and to revisit some of the assumptions made in crystallographic data reduction.