{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MTZ Data Types \n", "\n", "MTZ files use column types to specify what type of crystallographic data is contained within a given column (see [MTZ specification](https://www.ccp4.ac.uk/html/mtzformat.html#column-types)). This enables columns to have arbitrary names while ensuring that the column values are interpreted correctly. \n", "\n", "In order to ensure that MTZ data types behave as expected in ``rs.DataSet`` objects, we have implemented a set of custom ``pandas`` dtypes to represent the crystallographic data found in MTZ files. This facilitates MTZ file I/O, and makes it possible to write methods that operate only on expected types of crystallographic data. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import reciprocalspaceship as rs\n", "import numpy as np\n", "from IPython.display import HTML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "### Supported MTZ data types\n", "\n", "The following MTZ dtypes are available for `rs.DataSet` and `rs.DataSeries` objects:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MTZ CodeNameClassInternal
DAnomalousDifferenceAnomalousDifferenceDtypefloat32
BBatchBatchDtypeint32
KFriedelIntensityFriedelIntensityDtypefloat32
GFriedelSFAmplitudeFriedelStructureFactorAmplitudeDtypefloat32
HHKLHKLIndexDtypeint32
AHendricksonLattmanHendricksonLattmanDtypefloat32
JIntensityIntensityDtypefloat32
IMTZIntMTZIntDtypeint32
RMTZRealMTZRealDtypefloat32
YM/ISYMM_IsymDtypeint32
ENormalizedSFAmplitudeNormalizedStructureFactorAmplitudeDtypefloat32
PPhasePhaseDtypefloat32
QStddevStandardDeviationDtypefloat32
MStddevFriedelIStandardDeviationFriedelIDtypefloat32
LStddevFriedelSFStandardDeviationFriedelSFDtypefloat32
FSFAmplitudeStructureFactorAmplitudeDtypefloat32
WWeightWeightDtypefloat32
" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = rs.summarize_mtz_dtypes(print_summary=False)\n", "HTML(df.to_html(index=False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Internally, these are all stored as `numpy` arrays of 32-bit ints or floats. This is because MTZ files only take 32-bit values. It is worth keeping in mind that other data types can be stored in an `rs.DataSet` column or `rs.DataSeries`; however, only MTZ dtypes can be written out to an MTZ file. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "### Specifying MTZ data types\n", "\n", "It is possible to specify a dtype using the MTZ Code, Name, or Class from the above table:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.0\n", "1 1.0\n", "2 2.0\n", "dtype: Intensity" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data1 = rs.DataSeries([0, 1, 2], dtype=\"J\")\n", "data1" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.0\n", "1 1.0\n", "2 2.0\n", "dtype: Intensity" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data2 = rs.DataSeries([0, 1, 2], dtype=\"Intensity\")\n", "data2" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.0\n", "1 1.0\n", "2 2.0\n", "dtype: Intensity" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data3 = rs.DataSeries([0, 1, 2], dtype=rs.IntensityDtype())\n", "data3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you already have an `rs.DataSeries`, it is possible to change it to an MTZ dtype:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.0\n", "1 1.0\n", "2 2.0\n", "dtype: Intensity" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data4 = rs.DataSeries([0, 1, 2], dtype=np.int64)\n", "data4.astype(\"Intensity\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the example above, the `np.int64` array was converted into an array of `float32` values because that is that is the internal storage type for the `rs.IntensityDtype`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "### Inferring MTZ data types\n", "\n", "If data is read directly from a MTZ file, the proper MTZ data types will be set automatically. However, in order to facilitate working with MTZ dtypes, there is also support for inferring proper dtypes based on the underlying data and name of a `rs.DataSeries` or the columns of an `rs.DataSet`. Inferring the proper dtype is not always possible, but these functions are written to work for most common column names in MTZ files. If you come across common cases that do not seem to be supported, please feel free to file an [issue on GitHub](https://github.com/rs-station/reciprocalspaceship/issues).\n", "\n", "Inferring dtype for `DataSeries`:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 0\n", "1 1\n", "2 2\n", "Name: SigI, dtype: int64\n" ] } ], "source": [ "data = rs.DataSeries([0, 1, 2], name=\"SigI\")\n", "print(data)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.0\n", "1 1.0\n", "2 2.0\n", "Name: SigI, dtype: Stddev" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.infer_mtz_dtype()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possible to infer the dtype for all of the columns in a `DataSet`. To illustrate this, we will read in an MTZ file, set all of the columns to the `object` dtype, and infer the correct dtypes: " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "FreeR_flag MTZInt\n", "IMEAN Intensity\n", "SIGIMEAN Stddev\n", "I(+) FriedelIntensity\n", "SIGI(+) StddevFriedelI\n", "I(-) FriedelIntensity\n", "SIGI(-) StddevFriedelI\n", "N(+) MTZInt\n", "N(-) MTZInt\n", "dtype: object" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset = rs.read_mtz(\"../examples/data/HEWL_SSAD_24IDC.mtz\")\n", "dataset.dtypes" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "FreeR_flag object\n", "IMEAN object\n", "SIGIMEAN object\n", "I(+) object\n", "SIGI(+) object\n", "I(-) object\n", "SIGI(-) object\n", "N(+) object\n", "N(-) object\n", "dtype: object" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset = dataset.astype(object)\n", "dataset.dtypes" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "FreeR_flag MTZInt\n", "IMEAN Intensity\n", "SIGIMEAN Stddev\n", "I(+) FriedelIntensity\n", "SIGI(+) StddevFriedelI\n", "I(-) FriedelIntensity\n", "SIGI(-) StddevFriedelI\n", "N(+) MTZInt\n", "N(-) MTZInt\n", "dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset.infer_mtz_dtypes(inplace=True)\n", "dataset.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "### Switching between Friedel and non-Friedel data types\n", "\n", "Several MTZ data types are specific for anomalous data pertaining to Friedel pairs. For applicable `rs.DataSeries` objects, it is possible to switch between these data types. For data types without a Friedel-equivalent, the `rs.DataSeries` is returned unchanged:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.0\n", "1 1.0\n", "2 2.0\n", "dtype: FriedelIntensity" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Has Friedel-equivalent\n", "rs.DataSeries([0, 1, 2], dtype=\"Intensity\").to_friedel_dtype()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.0\n", "1 1.0\n", "2 2.0\n", "dtype: Intensity" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Has Friedel-equivalent\n", "rs.DataSeries([0, 1, 2], dtype=\"FriedelIntensity\").from_friedel_dtype()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 2\n", "dtype: MTZInt" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# No Friedel-equivalent\n", "rs.DataSeries([0, 1, 2], dtype=\"MTZInt\").to_friedel_dtype()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "### Writing out MTZ files\n", "\n", "Any data that will be written out to a MTZ format file must have an MTZ data type. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "FreeR_flag MTZInt\n", "IMEAN Intensity\n", "SIGIMEAN Stddev\n", "I(+) FriedelIntensity\n", "SIGI(+) StddevFriedelI\n", "I(-) FriedelIntensity\n", "SIGI(-) StddevFriedelI\n", "N(+) MTZInt\n", "N(-) MTZInt\n", "dtype: object" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset.dtypes" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "dataset.write_mtz(\"temp.mtz\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If there is a non-MTZ dtype in a `DataSet`, `DataSet.write_mtz()` will raise a `ValueError`." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "FreeR_flag MTZInt\n", "IMEAN Intensity\n", "SIGIMEAN Stddev\n", "I(+) FriedelIntensity\n", "SIGI(+) StddevFriedelI\n", "I(-) FriedelIntensity\n", "SIGI(-) StddevFriedelI\n", "N(+) MTZInt\n", "N(-) MTZInt\n", "Temp object\n", "dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset[\"Temp\"] = \"string\"\n", "dataset.dtypes" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "column Temp of type object cannot be written to an MTZ file. To skip columns without explicit MTZ dtypes, set skip_problem_mtztypes=True", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "Input \u001b[0;32mIn [18]\u001b[0m, in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mdataset\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mwrite_mtz\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mtemp.mtz\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/Documents/github/reciprocalspaceship/reciprocalspaceship/dataset.py:611\u001b[0m, in \u001b[0;36mDataSet.write_mtz\u001b[0;34m(self, mtzfile, skip_problem_mtztypes, project_name, crystal_name, dataset_name)\u001b[0m\n\u001b[1;32m 585\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 586\u001b[0m \u001b[38;5;124;03mWrite DataSet to MTZ file.\u001b[39;00m\n\u001b[1;32m 587\u001b[0m \n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 607\u001b[0m \u001b[38;5;124;03m Dataset name to assign to MTZ file\u001b[39;00m\n\u001b[1;32m 608\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 609\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mreciprocalspaceship\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m io\n\u001b[0;32m--> 611\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mio\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mwrite_mtz\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 612\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 613\u001b[0m \u001b[43m \u001b[49m\u001b[43mmtzfile\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 614\u001b[0m \u001b[43m \u001b[49m\u001b[43mskip_problem_mtztypes\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 615\u001b[0m \u001b[43m \u001b[49m\u001b[43mproject_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 616\u001b[0m \u001b[43m \u001b[49m\u001b[43mcrystal_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 617\u001b[0m \u001b[43m \u001b[49m\u001b[43mdataset_name\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 618\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/Documents/github/reciprocalspaceship/reciprocalspaceship/io/mtz.py:225\u001b[0m, in \u001b[0;36mwrite_mtz\u001b[0;34m(dataset, mtzfile, skip_problem_mtztypes, project_name, crystal_name, dataset_name)\u001b[0m\n\u001b[1;32m 191\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mwrite_mtz\u001b[39m(\n\u001b[1;32m 192\u001b[0m dataset,\n\u001b[1;32m 193\u001b[0m mtzfile,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 197\u001b[0m dataset_name,\n\u001b[1;32m 198\u001b[0m ):\n\u001b[1;32m 199\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 200\u001b[0m \u001b[38;5;124;03m Write an MTZ reflection file from the reflection data in a DataSet.\u001b[39;00m\n\u001b[1;32m 201\u001b[0m \n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 223\u001b[0m \u001b[38;5;124;03m Dataset name to assign to MTZ file\u001b[39;00m\n\u001b[1;32m 224\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m--> 225\u001b[0m mtz \u001b[38;5;241m=\u001b[39m \u001b[43mto_gemmi\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 226\u001b[0m \u001b[43m \u001b[49m\u001b[43mdataset\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mskip_problem_mtztypes\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mproject_name\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcrystal_name\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdataset_name\u001b[49m\n\u001b[1;32m 227\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 228\u001b[0m mtz\u001b[38;5;241m.\u001b[39mwrite_to_file(mtzfile)\n\u001b[1;32m 229\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m\n", "File \u001b[0;32m~/Documents/github/reciprocalspaceship/reciprocalspaceship/io/mtz.py:151\u001b[0m, in \u001b[0;36mto_gemmi\u001b[0;34m(dataset, skip_problem_mtztypes, project_name, crystal_name, dataset_name)\u001b[0m\n\u001b[1;32m 149\u001b[0m \u001b[38;5;28;01mcontinue\u001b[39;00m\n\u001b[1;32m 150\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m--> 151\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 152\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcolumn \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mc\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m of type \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mcseries\u001b[38;5;241m.\u001b[39mdtype\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m cannot be written to an MTZ file. \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 153\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mTo skip columns without explicit MTZ dtypes, set skip_problem_mtztypes=True\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 154\u001b[0m )\n\u001b[1;32m 155\u001b[0m mtz\u001b[38;5;241m.\u001b[39mset_data(temp[columns]\u001b[38;5;241m.\u001b[39mto_numpy(dtype\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfloat32\u001b[39m\u001b[38;5;124m\"\u001b[39m))\n\u001b[1;32m 157\u001b[0m \u001b[38;5;66;03m# Handle Unmerged data\u001b[39;00m\n", "\u001b[0;31mValueError\u001b[0m: column Temp of type object cannot be written to an MTZ file. To skip columns without explicit MTZ dtypes, set skip_problem_mtztypes=True" ] } ], "source": [ "dataset.write_mtz(\"temp.mtz\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the error message states, it is still possible to write out the MTZ by setting `skip_problem_mtztypes=True`. This will skip any columns with non-MTZ data types:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "dataset.write_mtz(\"temp.mtz\", skip_problem_mtztypes=True)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }