DataSet

class reciprocalspaceship.DataSet(data=None, index=None, columns=None, dtype=None, copy=False, spacegroup=None, cell=None, merged=None)[source]

Bases: DataFrame

Representation of a crystallographic dataset.

A DataSet object provides a tabular representation of reflection data. Reflections are conventionally indexed by Miller indices (rows), but can also be indexed by additional metadata. Per-reflection data can be stored as columns. For additional information about inherited methods and attributes, please see the Pandas.DataFrame documentation.

Attributes

acentrics

Access acentric reflections in DataSet

cell

Unit cell parameters (a, b, c, alpha, beta, gamma)

centrics

Access centric reflections in DataSet

merged

Whether DataSet contains merged reflection data (boolean)

reindexing_ops

Possible reindexing operations (merohedral twin laws) for DataSet

spacegroup

Crystallographic space group

Methods

__init__([data, index, columns, dtype, ...])

apply_symop(symop[, inplace])

Apply symmetry operation to all reflections in DataSet object.

assign_resolution_bins([bins, inplace, ...])

Assign reflections in DataSet to resolution bins.

canonicalize_phases([inplace])

Canonicalize columns with phase data to fall in the interval between -180 and 180 degrees.

compute_dHKL([inplace])

Compute the real space lattice plane spacing, d, associated with the HKL indices in the object.

compute_multiplicity([inplace, ...])

Compute the multiplicity of reflections in DataSet.

expand_anomalous()

Expands data by applying Friedel operator (-x, -y, -z).

expand_to_p1()

Generates all symmetrically equivalent reflections.

find_twin_laws([max_obliq, all_ops])

Find merohedral and pseudo-merohedral twin laws for cell and spacegroup of DataSet given an obliquity threshold (degrees).

from_gemmi(gemmiMtz)

Creates DataSet object from gemmi.Mtz object.

from_structurefactor(sf_key)

Convert complex structure factors to structure factor amplitudes and phases

get_complex_keys()

Return columns labels for data with complex dtype.

get_hkls()

Get the Miller indices in the DataSet as a ndarray.

get_m_isym_keys()

Return column labels for data with M/ISYM dtype.

get_phase_keys()

Return column labels for data with Phase dtype.

get_reciprocal_grid_size([dmin, sample_rate])

Determine an appropriate 3D grid size for reflection data.

hkl_to_asu([inplace, anomalous])

Map HKL indices to the reciprocal space asymmetric unit.

hkl_to_observed([m_isym, inplace])

Map HKL indices to their observed index using an M/ISYM column.

infer_mtz_dtypes([inplace, index])

Infers MTZ dtypes from column names and underlying data.

is_isomorphous(other[, cell_threshold])

Determine whether DataSet is isomorphous to another DataSet.

join(*args[, check_isomorphous])

Join DataSets or named DataSeries using a database-style join on columns or indices.

label_absences([inplace])

Label systematically absent reflections in DataSet.

label_centrics([inplace])

Label centric reflections in DataSet.

merge(*args[, check_isomorphous])

Merge DataSet or named DataSeries using a database-style join on columns or indices.

remove_absences([inplace])

Remove systematically absent reflections in DataSet.

reset_index([level, drop, inplace, ...])

Reset the index or a specific level of a MultiIndex.

select_mtzdtype(dtype)

Return subset of DataSet’s columns that are of the given dtype.

set_index(keys[, drop, append, inplace, ...])

Set the DataSet index using existing columns.

stack_anomalous([plus_labels, minus_labels, ...])

Convert data from two-column anomalous format to one-column format.

to_gemmi([skip_problem_mtztypes, ...])

Creates gemmi.Mtz object from DataSet object.

to_numpy([dtype, copy, na_value])

Convert the DataSet to a NumPy array.

to_pickle(path, *args, **kwargs)

Pickle object to file.

to_reciprocal_grid(key[, sample_rate, dmin, ...])

Set up reciprocal grid with values from column, key, indexed by Miller indices.

to_reciprocalgrid(key[, sample_rate, dmin, ...])

Deprecated: Set up reciprocal grid with values from column, key, indexed by Miller indices.

to_structurefactor(sf_key, phase_key)

Convert structure factor amplitudes and phases to complex structure factors

unstack_anomalous([columns, suffixes])

Convert data from one-column format to two-column anomalous format.

write_mtz(mtzfile[, skip_problem_mtztypes, ...])

Write DataSet to MTZ file.

property acentrics

Access acentric reflections in DataSet

apply_symop(symop, inplace=False)[source]

Apply symmetry operation to all reflections in DataSet object.

Parameters:
  • symop (str, gemmi.Op) – Gemmi symmetry operation or string representing symmetry op

  • inplace (bool) – Whether to return a new DataFrame or make the change in place

assign_resolution_bins(bins=20, inplace=False, return_labels=True, format_str='.2f', return_edges=False)[source]

Assign reflections in DataSet to resolution bins.

Notes

  • If bin edges are provided, any reflections outside of the specified range are dropped.

Parameters:
  • bins (int, list, or np.ndarray) – Number of bins or bin edges to use when assigning resolution bins. If bin edges are provided, they must be monotonic (default: 20)

  • inplace (bool) – Whether to add the column in place or return a copy (default: False)

  • return_labels (bool) – Whether to return a list of labels corresponding to the edges of each resolution bin (default: True)

  • format_str (str) – Format string for constructing bin labels

  • return_edges (bool) – Whether to return bin edges that define the resolution bin boundaries. The bin edges are returned as a 1-dimensional array with bins + 1 entries (default: False)

Returns:

(DataSet, list), (DataSet, ndarray), (DataSet, list, ndarray) or DataSet

canonicalize_phases(inplace=False)[source]

Canonicalize columns with phase data to fall in the interval between -180 and 180 degrees. This method will modify the values within any column composed of data with the PhaseDtype.

Parameters:

inplace (bool) – Whether to modify the DataSet in place or return a copy

Returns:

DataSet

property cell

Unit cell parameters (a, b, c, alpha, beta, gamma)

property centrics

Access centric reflections in DataSet

compute_dHKL(inplace=False)[source]

Compute the real space lattice plane spacing, d, associated with the HKL indices in the object.

Parameters:

inplace (bool) – Whether to add the column in place or return a copy

compute_multiplicity(inplace=False, include_centering=True)[source]

Compute the multiplicity of reflections in DataSet. A new column of floats, “EPSILON”, is added to the object.

Parameters:
  • inplace (bool) – Whether to add the column in place or to return a copy

  • include_centering (bool) – Whether to include centering operations in the multiplicity calculation. The default is to include them.

expand_anomalous()[source]

Expands data by applying Friedel operator (-x, -y, -z). The necessary phase shifts are made for columns of complex dtypes or PhaseDtypes.

Returns:

DataSet

expand_to_p1()[source]

Generates all symmetrically equivalent reflections. The spacegroup symmetry is set to P1.

Returns:

DataSet

find_twin_laws(max_obliq=1.0, all_ops=False)[source]

Find merohedral and pseudo-merohedral twin laws for cell and spacegroup of DataSet given an obliquity threshold (degrees).

Notes

  • With max_obliq=1e-6 and all_ops=False, this method returns the same operators as DataSet.reindexing_ops

  • For additional information, see the GEMMI symmetry page.

Parameters:
  • max_obliq (float) – Obliquity threshold (in degrees) as defined in Le Page, J Appl Cryst (1982). (default: 1.0)

  • all_ops (bool) – Whether to return all twin operators. If False, only non-redundant operators are returned (coset representative).

Returns:

List of gemmi.Op

classmethod from_gemmi(gemmiMtz)[source]

Creates DataSet object from gemmi.Mtz object.

If the gemmi.Mtz object contains an M/ISYM column and contains duplicated Miller indices, an unmerged DataSet will be constructed. The Miller indices will be mapped to their observed values, and a partiality flag will be extracted and stored as a boolean column with the label, PARTIAL. Otherwise, a merged DataSet will be constructed.

If columns are found with the MTZInt dtype and are labeled PARTIAL or CENTRIC, these will be interpreted as boolean flags used to label partial or centric reflections, respectively.

Parameters:

gemmiMtz (gemmi.Mtz)

Returns:

DataSet

from_structurefactor(sf_key)[source]

Convert complex structure factors to structure factor amplitudes and phases

Parameters:

sf_key (str) – Column label for complex structure factors

Returns:

(sf, phase) (tuple of DataSeris) – Tuple of DataSeries for the structure factor amplitudes and phases corresponding to the complex structure factors

See also

DataSet.to_structurefactor

Convert amplitude and phase to complex structure factor

get_complex_keys()[source]

Return columns labels for data with complex dtype.

Returns:

keys (list of strings) – list of column labels with complex dtype

get_hkls()[source]

Get the Miller indices in the DataSet as a ndarray.

Returns:

hkl (ndarray, shape=(n_reflections, 3)) – Miller indices in DataSet

get_m_isym_keys()[source]

Return column labels for data with M/ISYM dtype.

Returns:

key (list of strings) – list of column labels with M/ISYM dtype

get_phase_keys()[source]

Return column labels for data with Phase dtype.

Returns:

keys (list of strings) – list of column labels with Phase dtype

get_reciprocal_grid_size(dmin=None, sample_rate=3.0)[source]

Determine an appropriate 3D grid size for reflection data.

Returns the smallest grid size that yields a real-space grid spacing of at most dmin/sample_rate (in Å). The returned grid size will be ‘FFT-friendly’ (2, 3, or 5 are the largest prime factors), and will obey any symmetry constraints of the spacegroup.

Parameters:
  • dmin (float) – Highest-resolution reflection to consider for grid size

  • sample_rate (float) – Sets the minimal grid spacing relative to dmin. For example, sample_rate=3 corresponds to a real-space sampling of dmin/3. Value must be >= 1.0 (default: 3.0)

Returns:

list(int, int, int) – Grid size with desired spacing (list of 3 integers)

hkl_to_asu(inplace=False, anomalous=False)[source]

Map HKL indices to the reciprocal space asymmetric unit. If phases are included in the DataSet, they will be changed according to the phase shift associated with the necessary symmetry operation.

If DataSet.merged == False, and a partiality flag labeled PARTIAL is included in the DataSet, the partiality flag will be used to construct a proper M/ISYM column. Both merged and unmerged DataSets will have an M/ISYM column added.

Parameters:
  • inplace (bool) – Whether to modify the DataSet in place or return a copy

  • anomalous (bool) – If True, acentric reflections will be mapped to the +/- ASU. If False, all reflections are mapped to the Friedel-plus ASU.

Returns:

DataSet

See also

DataSet.hkl_to_observed

Opposite of DataSet.hkl_to_asu()

hkl_to_observed(m_isym=None, inplace=False)[source]

Map HKL indices to their observed index using an M/ISYM column. This method applies the symmetry operation specified by the M/ISYM column to each Miller index in the DataSet. If phases are included in the DataSet, they will be changed by the phase shift associated with the symmetry operation.

If DataSet.merged == False, the M/ISYM column is used to construct a partiality flag labeled PARTIAL. This is added to the DataSet, and the M/ISYM column is dropped. If DataSet.merged == True, the M/ISYM column is dropped, but a partiality flag is not added.

Parameters:
  • m_isym (str) – Column label for M/ISYM values in DataSet. If m_isym is None and a single M/ISYM column is present, it will automatically be used.

  • inplace (bool) – Whether to modify the DataSet in place or return a copy

Returns:

DataSet

See also

DataSet.hkl_to_asu

Opposite of DataSet.hkl_to_observed()

infer_mtz_dtypes(inplace=False, index=True)[source]

Infers MTZ dtypes from column names and underlying data. This method iterates over each column in the DataSet and tries to infer its proper MTZ dtype based on common MTZ naming conventions.

If a given column is already a MTZDtype, its type will be unchanged. If index is True, the MTZ dtypes will be inferred for named columns in the index.

Parameters:
  • inplace (bool) – Whether to modify the dtypes in place or to return a copy

  • index (bool) – Infer MTZ dtypes for named column(s) in the DataSet index

Returns:

DataSet

See also

DataSeries.infer_mtz_dtype

Infer MTZ dtype for DataSeries

is_isomorphous(other, cell_threshold=0.5)[source]

Determine whether DataSet is isomorphous to another DataSet. This method confirms isomorphism by ensuring the spacegroups are equivalent, and that the cell parameters are within a specified percentage (see cell_threshold).

Parameters:
  • other (rs.DataSet) – DataSet to which it will be compared

  • cell_threshold (float) – Acceptable percent difference between unit cell parameters

Returns:

bool

join(*args, check_isomorphous=True, **kwargs)[source]

Join DataSets or named DataSeries using a database-style join on columns or indices. This method can be used to join lists rs objects to a given DataSet.

For additional documentation on accepted arguments, see the Pandas DataFrame.join() API.

Parameters:

check_isomorphous (bool) – If True, the spacegroup and cell attributes of DataSets in other will be compared to those of the calling DataSet to ensure they are isomorphous.

Returns:

rs.DataSet

See also

DataSet.merge

Similar method with added flexibility for distinct column labels

label_absences(inplace=False)[source]

Label systematically absent reflections in DataSet. A new column of booleans, “ABSENT”, is added to the object.

Parameters:

inplace (bool) – Whether to add the column in place or to return a copy

label_centrics(inplace=False)[source]

Label centric reflections in DataSet. A new column of booleans, “CENTRIC”, is added to the object.

Parameters:

inplace (bool) – Whether to add the column in place or to return a copy

merge(*args, check_isomorphous=True, **kwargs)[source]

Merge DataSet or named DataSeries using a database-style join on columns or indices.

For additional documentation on accepted arguments, see the Pandas DataFrame.merge() API.

Parameters:

check_isomorphous (bool) – If True, the spacegroup and cell attributes of DataSets in other will be compared to those of the calling DataSet to ensure they are isomorphous.

Returns:

rs.DataSet

See also

DataSet.join

Similar method with support for lists of rs objects

property merged

Whether DataSet contains merged reflection data (boolean)

property reindexing_ops

Possible reindexing operations (merohedral twin laws) for DataSet

remove_absences(inplace=False)[source]

Remove systematically absent reflections in DataSet.

Parameters:

inplace (bool) – Whether to add the column in place or to return a copy

Returns:

DataSet

reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='', allow_duplicates=<no_default>, names=None)[source]

Reset the index or a specific level of a MultiIndex.

Reset the index to use a numbered RangeIndex. Using the level argument, it is possible to reset one or more levels of a MultiIndex.

Parameters:
  • level (int, str, tuple, list) – Only remove given levels from the index. Defaults to all levels

  • drop (bool) – Do not try to insert index into dataframe columns.

  • inplace ; bool – Modify the DataSet in place (do not create a new object).

  • col_level (int or str) – If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level.

  • col_fill (object) – If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.

  • allow_duplicates (bool) – Allow duplicate column labels to be created.

  • names (int, str, tuple, list) – Using the given string, rename the DataSet column which contains the index data. If the DataSet has a MultiIndex, this has to be a list or tuple with length equal to the number of levels.

Returns:

DataSet or None – DataSet with the new index or None if inplace=True

See also

DataSet.set_index

Set index

select_mtzdtype(dtype)[source]

Return subset of DataSet’s columns that are of the given dtype.

Parameters:

dtype (str or instance of MTZDtype) – Single-letter MTZ code, name, or MTZDtype instance to return

Returns:

DataSet – Subset of the DataSet with columns matching the requested dtype. If no columns of the requested dtype are found, an empty DataSet is returned.

Raises:

ValueError – If dtype is not a string nor a MTZDtype instance

set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)[source]

Set the DataSet index using existing columns.

Set the DataSet index (row labels) using one or more existing columns or arrays (of the correct length). The index can replace the existing index or expand on it.

Parameters:
  • keys (label or array-like or list of labels/arrays) – This parameter can be either a single column key, a single array of the same length as the calling DataSet, or a list containing an arbitrary combination of column keys and arrays.

  • drop (bool) – Whether to delete columns to be used as the new index.

  • append (bool) – Whether to append columns to existing index.

  • inplace (bool) – Modify the DataFrame in place (do not create a new object).

  • verify_integrity (bool) – Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method

Returns:

DataSet or None – DataSet with the new index or None if inplace=True

See also

DataSet.reset_index

Reset index

property spacegroup

Crystallographic space group

stack_anomalous(plus_labels=None, minus_labels=None, suffixes=('(+)', '(-)'))[source]

Convert data from two-column anomalous format to one-column format. Intensities, structure factor amplitudes, or other data are converted from separate columns corresponding to a single Miller index to the same data column at different rows indexed by the Friedel-plus or Friedel-minus Miller index.

This method will return a DataSet with, at most, twice as many rows as the original – one row for each Friedel pair. In most cases, the resulting DataSet will be smaller, because centric reflections will not be stacked. For a merged DataSet, this has the effect of mapping reflections from the positive reciprocal space ASU to the positive and negative reciprocal space ASU, for Friedel-plus and Friedel-minus reflections, respectively.

Notes

  • A ValueError is raised if invoked with an unmerged DataSet

  • It is assumed that Friedel-plus column labels are suffixed with (+), and that Friedel-minus column labels are suffixed with (-)

  • A ValueError is raised if stripping suffixes will lead to a duplicate column name

  • Corresponding column labels are expected to be given in the same order

Parameters:
  • plus_labels (str or list-like) – Column label or list of column labels of data associated with Friedel-plus reflections

  • minus_labels (str or list-like) – Column label or list of column labels of data associated with Friedel-minus reflections

  • suffixes (list of strings) – Suffixes to identify column labels associated with Friedel-plus and Friedel-minus reflections. Only consulted if plus_labels and minus_labels are None. Defaults to (“(+)”, “(-)”)

Returns:

DataSet

See also

DataSet.unstack_anomalous

Opposite of stack_anomalous

to_gemmi(skip_problem_mtztypes=False, project_name='reciprocalspaceship', crystal_name='reciprocalspaceship', dataset_name='reciprocalspaceship')[source]

Creates gemmi.Mtz object from DataSet object.

If dataset.merged == False, the reflections will be mapped to the reciprocal space ASU, and a M/ISYM column will be constructed.

If boolean flags with the label PARTIAL or CENTRIC are found in the DataSet, these will be cast to the MTZInt dtype, and included in the gemmi.Mtz object.

Parameters:
  • skip_problem_mtztypes (bool) – Whether to skip columns in DataSet that do not have specified MTZ datatypes

  • project_name (str) – Project name to assign to MTZ file

  • crystal_name (str) – Crystal name to assign to MTZ file

  • dataset_name (str) – Dataset name to assign to MTZ file

Returns:

gemmi.Mtz

to_numpy(dtype=None, copy=False, na_value=<no_default>)[source]

Convert the DataSet to a NumPy array.

This method will attempt to infer a consensus numpy dtype from the dtypes of the DataSet columns. If the DataSet is composed of all int32-backed MTZ dtypes and does contain NaN values, the returned dtype will be int32. For all other combinations of MTZDtype, the returned dtype will be float32. If the DataSet contains dtypes other than MTZDtype, the default Pandas behavior is used (see Pandas documentation).

Parameters:
  • dtype (str or np.dtype) – The dtype to pass to np.asarray()

  • copy (bool) – Whether to ensure that the returned value is not a view on another array. Note that copy=False does not ensure that to_numpy() is no-copy. Rather, copy=True ensure that a copy is made, even if not strictly necessary. (default: False)

  • na_value (Any) – The value to use for missing values. The default value depends on dtype and the dtypes of the DataSet columns.

Returns:

np.ndarray

to_pickle(path, *args, **kwargs)[source]

Pickle object to file.

This can be useful for saving non-MTZ compatible data files for future use. For additional documentation on accepted arguments, see the Pandas DataFrame.to_pickle() API.

Parameters:

path (str) – File path where the pickled object will be stored.

See also

read_pickle

to_reciprocal_grid(key, sample_rate=3.0, dmin=None, grid_size=None)[source]

Set up reciprocal grid with values from column, key, indexed by Miller indices.

Notes

  • The data being arranged on a reciprocal grid must be compatible with a numpy datatype.

  • Any missing Miller indices are initialized to zero.

  • If explicitly provided, grid_size supersedes dmin and sample_rate for grid size determination.

  • The grid size determined using sample_rate and dmin will depend on the cell parameters of the dataset. If the grid size must be consistent across different isomorphous cell parameters, grid_size can be explicitly provided.

Parameters:
  • key (str) – Column label for value to arrange on reciprocal grid

  • sample_rate (float) – Sets the minimal grid spacing relative to dmin. For example, sample_rate=3 corresponds to a real-space sampling of dmin/3. (default: 3.0)

  • dmin (float) – Highest-resolution reflection to consider for grid size. If None, dmin will be set to the highest resolution reflection in the dataset. The reflections used to populate the grid will also be truncated to dHKL >= dmin (default: None)

  • grid_size (array-like (len==3)) – If given, provides the explicit dimensions for 3D reciprocal grid. If None, grid size will be set based on sample_rate and dmin. If provided, this grid size will be used regardless of the values provided as sample_rate and dmin

Returns:

numpy.ndarray

to_reciprocalgrid(key, sample_rate=3.0, dmin=None, gridsize=None)[source]

Deprecated: Set up reciprocal grid with values from column, key, indexed by Miller indices.

Warning

This function is deprecated. Use to_reciprocal_grid() instead.

Notes

  • The data being arranged on a reciprocal grid must be compatible with a numpy datatype.

  • Any missing Miller indices are initialized to zero.

  • If explicitly provided, gridsize supersedes dmin and sample_rate for grid size determination.

  • The grid size determined using sample_rate and dmin will depend on the cell parameters of the dataset. If the grid size must be consistent across different isomorphous cell parameters, gridsize can be explicitly provided.

Parameters:
  • key (str) – Column label for value to arrange on reciprocal grid

  • sample_rate (float) – Sets the minimal grid spacing relative to dmin. For example, sample_rate=3 corresponds to a real-space sampling of dmin/3. (default: 3.0)

  • dmin (float) – Highest-resolution reflection to consider for grid size. If None, dmin will be set to the highest resolution reflection in the dataset. The reflections used to populate the grid will also be truncated to dHKL >= dmin (default: None)

  • gridsize (array-like (len==3)) – If given, provides the explicit dimensions for 3D reciprocal grid. If None, grid size will be set based on sample_rate and dmin. If provided, this grid size will be used regardless of the values provided as sample_rate and dmin

Returns:

numpy.ndarray

to_structurefactor(sf_key, phase_key)[source]

Convert structure factor amplitudes and phases to complex structure factors

Parameters:
  • sf_key (str) – Column label for structure factor amplitudes

  • phase_key (str) – Column label for phases

Returns:

rs.DataSeries – Complex structure factors

See also

DataSet.from_structurefactor

Convert complex structure factor to amplitude and phase

unstack_anomalous(columns=None, suffixes=('(+)', '(-)'))[source]

Convert data from one-column format to two-column anomalous format. Provided column labels are converted from separate rows indexed by their Friedel-plus or Friedel-minus Miller index to different columns indexed at the Friedel-plus HKL.

This method will return a smaller DataSet than the original – Friedel pairs will both be indexed at the Friedel-plus index. This has the effect of mapping reflections to the positive reciprocal space ASU, including data for both Friedel pairs at the Friedel-plus Miller index.

Notes

  • A ValueError is raised if invoked with an unmerged DataSet

Parameters:
  • columns (str or list-like) – Column label or list of column labels of data that should be associated with Friedel pairs. If None, all columns are converted to the two-column anomalous format.

  • suffixes (tuple or list of str) – Suffixes to append to Friedel-plus and Friedel-minus data columns

Returns:

DataSet

See also

DataSet.stack_anomalous

Opposite of unstack_anomalous

write_mtz(mtzfile, skip_problem_mtztypes=False, project_name='reciprocalspaceship', crystal_name='reciprocalspaceship', dataset_name='reciprocalspaceship')[source]

Write DataSet to MTZ file.

If DataSet.merged == False, the reflections will be mapped to the reciprocal space ASU, and a M/ISYM column will be constructed.

If boolean flags with the label PARTIAL or CENTRIC are found in the DataSet, these will be cast to the MTZInt dtype, and included in the output MTZ file.

Parameters:
  • mtzfile (str or file) – name of an mtz file or a file object

  • skip_problem_mtztypes (bool) – Whether to skip columns in DataSet that do not have specified MTZ datatypes

  • project_name (str) – Project name to assign to MTZ file

  • crystal_name (str) – Crystal name to assign to MTZ file

  • dataset_name (str) – Dataset name to assign to MTZ file