matminer.featurizers package

Submodules

matminer.featurizers.bandstructure module

class matminer.featurizers.bandstructure.BandFeaturizer(kpoints=None, find_method='nearest', nbands=2)

Bases: matminer.featurizers.base.BaseFeaturizer

Featurizes a pymatgen band structure object. Args:

kpoints ([1x3 numpy array]): list of fractional coordinates of
k-points at which energy is extracted.
find_method (str): the method for finding or interpolating for energy

at given kpoints. It does nothing if kpoints is None. options are:

‘nearest’: the energy of the nearest available k-point to
the input k-point is returned.

‘linear’: the result of linear interpolation is returned see the documentation for scipy.interpolate.griddata

nbands (int): the number of valence/conduction bands to be featurized

__init__(kpoints=None, find_method='nearest', nbands=2)
citations()
feature_labels()
featurize(bs)
Args:
bs (pymatgen BandStructure or BandStructureSymmLine or their dict):
The band structure to featurize. To obtain all features, bs should include the structure attribute.
Returns:
([float]): a list of band structure features. If not bs.structure,
features that require the structure will be returned as NaN.
List of currently supported features:

band_gap (eV): the difference between the CBM and VBM energy is_gap_direct (0.0|1.0): whether the band gap is direct or not direct_gap (eV): the minimum direct distance of the last

valence band and the first conduction band
p_ex1_norm (float): k-space distance between Gamma point
and k-point of VBM
n_ex1_norm (float): k-space distance between Gamma point
and k-point of CBM

p_ex1_degen: degeneracy of VBM n_ex1_degen: degeneracy of CBM if kpoints is provided (e.g. for kpoints == [[0.0, 0.0, 0.0]]):

n_0.0;0.0;0.0_en: (energy of the first conduction band at
[0.0, 0.0, 0.0] - CBM energy)
p_0.0;0.0;0.0_en: (energy of the last valence band at
[0.0, 0.0, 0.0] - VBM energy)
static get_bindex_bspin(extremum, is_cbm)

Returns the band index and spin of band extremum

Args:
extremum (dict): dictionary containing the CBM/VBM, i.e. output of
Bandstructure.get_cbm()

is_cbm (bool): whether the extremum is the CBM or not

implementors()
class matminer.featurizers.bandstructure.BranchPointEnergy(n_vb=1, n_cb=1, calculate_band_edges=True)

Bases: matminer.featurizers.base.BaseFeaturizer

__init__(n_vb=1, n_cb=1, calculate_band_edges=True)

Calculates the branch point energy and (optionally) an absolute band edge position assuming the branch point energy is the center of the gap

Args:

n_vb: (int) number of valence bands to include in BPE calc n_cb: (int) number of conduction bands to include in BPE calc calculate_band_edges: (bool) whether to also return band edge

positions
citations()
feature_labels()
featurize(bs, target_gap=None)
Args:
bs: (BandStructure) Uniform (not symm line) band structure
Returns:
(int) branch point energy on same energy scale as BS eigenvalues
implementors()

matminer.featurizers.base module

class matminer.featurizers.base.BaseFeaturizer

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Abstract class to calculate features from raw materials input data such a compound formula or a pymatgen crystal structure or bandstructure object.

## Using a BaseFeaturizer Class

There are multiple ways for running the featurize routines:

featurize: Featurize a single entry featurize_many: Featurize a list of entries featurize_dataframe: Compute features for many entries, store results

as columns in a dataframe

Some featurizers require first calling the fit method before the featurization methods can function. Generally, you pass the dataset to fit to determine which features a featurizer should compute. For example, a featurizer that returns the partial radial distribution function may need to know which elements are present in a dataset.

You can also employ the featurizer as part of a ScikitLearn Pipeline object. For these cases, scikit-learn calls the transform function of the BaseFeaturizer which is a less-featured wrapper of featurize_many. You would then provide your input data as an array to the Pipeline, which would output the featurers as an array.

Beyond the featurizing capability, BaseFeaturizer also includes methods for retrieving proper references for a featurizer. The citations function returns a list of papers that should be cited. The implementors function returns a list of people who wrote the featurizer, so that you know who to contact with questions.

## Implementing a New BaseFeaturizer Class

These operations must be implemented for each new featurizer:
featurize - Takes a single material as input, returns the features of
that material.
feature_labels - Generates a human-meaningful name for each of the
features.

citations - Returns a list of citations in BibTeX format implementors - Returns a list of people who contributed writing a paper

None of these operations should change the state of the featurizer. I.e., running each method twice should no produce different results, no class attributes should be changed, unning one operation should not affect the output of another.

All options of the featurizer must be set by the __init__ function. All options must be listed as keyword arguments with default values, and the value must be saved as a class attribute with the same name (e.g., argument n should be stored in self.n). These requirements are necessary for compatibility with the get_params and set_params methods of BaseEstimator, which enable easy interoperability with scikit-learn.

Depending on the complexity of your featurizer, it may be worthwhile to implement a from_preset class method. The from_preset method takes the name of a preset and returns an instance of the featurizer with some hard-coded set of inputs. The from_preset option is particularly useful for defining the settings used by papers in the literature.

Optionally, you can implement the fit operation if there are attributes of your featurizer that must be set for the featurizer to work. Any variables that are set by fitting should be stored as class attributes that end with an underscore. (This follows the pattern used by scikit-learn).

Another implementation to consider is whether it is worth making any utility operations for your featurizer. featurize must return a list of features, but this may not be the most natural representation for your features (e.g., a dict could be better). Making a separate function for computing features in this natural representation and having the featurize function call this method and then convert the data into a list is a recommended approach. Users who want to compute the representation in the natural form can use the utility function and users who want the data in a ML-ready format (list) can call featurize. See PartialRadialDistributionFunction for an example of this concept.

## Documenting a BaseFeaturizer

The class documentation for each featurizer must contain a description of the options and the features that will be computed. The options of the class

must all be defined in the __init__ function of the class, and we recommend documenting them using the

[Google style](https://google.github.io/styleguide/pyguide.html).

We recommend starting the class documentation with a high-level overview of the features. For example, mention what kind of characteristics of the material they describe and refer the reader to a paper that describes these features well (use a hyperlink if possible, so that the readthedocs will like to that paper). Then, describe each of the individual features in a block named “Features”. It is necessary here to give the user enough information for user to map a feature name what it means. The objective in this part is to allow people to understand what each column of their dataframe is without having to read the Python code. You do not need to explain all of the math/algorithms behind each feature for them to be able to reproduce the feature, just to get an idea what it is.

citations()

Citation(s) and reference(s) for this feature.

Returns:
(list) each element should be a string citation,
ideally in BibTeX format.
feature_labels()

Generate attribute names.

Returns:
([str]) attribute labels.
featurize(*x)

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Args:
x: input data to featurize (type depends on featurizer).
Returns:
(list) one or more features.
featurize_dataframe(df, col_id, ignore_errors=False, return_errors=False, inplace=True)

Compute features for all entries contained in input dataframe.

Args:

df (Pandas dataframe): Dataframe containing input data. col_id (str or list of str): column label containing objects to

featurize. Can be multiple labels if the featurize function requires multiple inputs.
ignore_errors (bool): Returns NaN for dataframe rows where
exceptions are thrown if True. If False, exceptions are thrown as normal.
return_errors (bool). Returns the errors encountered for each
row in a separate XFeaturizer errors column if True. Requires ignore_errors to be True.

inplace (bool): Whether to add new columns to input dataframe (df)

Returns:
updated dataframe.
featurize_many(entries, ignore_errors=False, return_errors=False)

Featurize a list of entries. If featurize takes multiple inputs, supply inputs as a list of tuples.

Args:

entries (list): A list of entries to be featurized. ignore_errors (bool): Returns NaN for entries where exceptions are

thrown if True. If False, exceptions are thrown as normal.
return_errors (bool): If True, returns the feature list as
determined by ignore_errors with traceback strings added as an extra ‘feature’. Entries which featurize without exceptions have this extra feature set to NaN.
Returns:
(list) features for each entry.
featurize_wrapper(x)

An exception wrapper for featurize, used in featurize_many and featurize_dataframe. featurize_wrapper changes the behavior of featurize when ignore_errors is True in featurize_many/dataframe.

Args:
x: input data to featurize (type depends on featurizer).
Returns:
(list) one or more features.
fit(X, y=None, **fit_kwargs)

Update the parameters of this featurizer based on available data

Args:
X - [list of tuples], training data
Returns:
self
implementors()

List of implementors of the feature.

Returns:
(list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
n_jobs
set_n_jobs(n_jobs)

Set the number of threads for this

transform(X)

Compute features for a list of inputs

class matminer.featurizers.base.MultipleFeaturizer(featurizers)

Bases: matminer.featurizers.base.BaseFeaturizer

Class that runs multiple featurizers on the same data All featurizers must take the same kind of data as input to the featurize function.

__init__(featurizers)

Create a new instance of this featurizer.

Args:
featurizers ([BaseFeaturizer]): list of featurizers to run.
citations()
feature_labels()
featurize(*x)
implementors()

matminer.featurizers.composition module

class matminer.featurizers.composition.AtomicOrbitals

Bases: matminer.featurizers.base.BaseFeaturizer

Determine the highest occupied molecular orbital (HOMO) and lowest unocupied molecular orbital (LUMO) in a composition. The atomic orbital energies of neutral ions with LDA-DFT were computed by NIST. https://www.nist.gov/pml/data/atomic-reference-data-electronic-structure-calculations

citations()
feature_labels()
featurize(comp)
Args:
comp: (Composition)
pymatgen Composition object
Returns:

HOMO_character: (str) orbital symbol (‘s’, ‘p’, ‘d’, or ‘f’) HOMO_element: (str) symbol of element for HOMO HOMO_energy: (float in eV) absolute energy of HOMO LUMO_character: (str) orbital symbol (‘s’, ‘p’, ‘d’, or ‘f’) LUMO_element: (str) symbol of element for LUMO LUMO_energy: (float in eV) absolute energy of LUMO gap_AO: (float in eV)

the estimated bandgap from HOMO and LUMO energeis
implementors()
class matminer.featurizers.composition.BandCenter

Bases: matminer.featurizers.base.BaseFeaturizer

citations()
feature_labels()
featurize(comp)

(Rough) estimation of absolution position of band center using geometric mean of electronegativity.

Args:
comp (Composition).
Returns:
(float) band center.
implementors()
class matminer.featurizers.composition.CationProperty(data_source, features, stats)

Bases: matminer.featurizers.composition.ElementProperty

Features based on the properties of cations in a material

Requires that oxidation states have already been determined

Computes composition-weighted statistics of different elemental properties

citations()
feature_labels()
featurize(comp)
classmethod from_preset(preset_name)
class matminer.featurizers.composition.CohesiveEnergy(mapi_key=None)

Bases: matminer.featurizers.base.BaseFeaturizer

__init__(mapi_key=None)

Get cohesive energy per atom of a compound by adding known elemental cohesive energies from the formation energy of the compound.

Parameters:
mapi_key (str): Materials API key for looking up formation energy
by composition alone (if you don’t set the formation energy yourself).
citations()
feature_labels()
featurize(comp, formation_energy_per_atom=None)
Args:

comp: (str) compound composition, eg: “NaCl” formation_energy_per_atom: (float) the formation energy per atom of

your compound. If not set, will look up the most stable formation energy from the Materials Project database.
implementors()
class matminer.featurizers.composition.ElectronAffinity

Bases: matminer.featurizers.base.BaseFeaturizer

Calculate average electron affinity times formal charge of anion elements. Note: The formal charges must already be computed before calling featurize. Generates average (electron affinity*formal charge) of anions.

__init__()
citations()
feature_labels()
featurize(comp)
Args:
comp: (Composition) Composition to be featurized
Returns:
avg_anion_affin (single-element list): average electron affinity*formal charge of anions
implementors()
class matminer.featurizers.composition.ElectronegativityDiff(stats=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Features based on the electronegativity difference between the anions and cations in the material.

These features are computed by first determining the concentration-weighted average electronegativity of the anions. For example, the average electronegativity of the anions in CaCoSO is equal to 1/2 of that of S and 1/2 of that of O. We then compute the difference between the electronegativity of each cation and the average anion electronegativity.

The feature values are then determined based on the concentration-weighted statistics in the same manner as ElementProperty features. For example, one value could be the mean electronegativity difference over all the anions.

Parameters:
data_source (data class): source from which to retrieve element data stats: Property statistics to compute

Generates average electronegativity difference between cations and anions

__init__(stats=None)
citations()
feature_labels()
featurize(comp)
Args:
comp: Pymatgen Composition object
Returns:
en_diff_stats (list of floats): Property stats of electronegativity difference
implementors()
class matminer.featurizers.composition.ElementFraction

Bases: matminer.featurizers.base.BaseFeaturizer

Class to calculate the atomic fraction of each element in a composition. Generates a vector where each index represents an element in atomic number order.

__init__()
citations()
feature_labels()
featurize(comp)
Args:
comp: Pymatgen Composition object
Returns:
vector (list of floats): fraction of each element in a composition
implementors()
class matminer.featurizers.composition.ElementProperty(data_source, features, stats)

Bases: matminer.featurizers.base.BaseFeaturizer

Class to calculate elemental property attributes. To initialize quickly, use the from_preset() method.

Parameters:
data_source (AbstractData or str): source from which to retrieve
element property data (or use str for preset: “pymatgen”, “magpie”, or “deml”)
features (list of strings): List of elemental properties to use
(these must be supported by data_source)
stats (list of strings): a list of weighted statistics to compute to for each
property (see PropertyStats for available stats)
__init__(data_source, features, stats)
citations()
feature_labels()
featurize(comp)

Get elemental property attributes

Args:
comp: Pymatgen composition object
Returns:
all_attributes: Specified property statistics of features
classmethod from_preset(preset_name)

Return ElementProperty from a preset string Args:

preset_name: (str) can be one of “magpie”, “deml”, or “matminer”

Returns:

implementors()
class matminer.featurizers.composition.IonProperty(data_source=<matminer.utils.data.PymatgenData object>, fast=False)

Bases: matminer.featurizers.base.BaseFeaturizer

Class to calculate ionic property attributes

__init__(data_source=<matminer.utils.data.PymatgenData object>, fast=False)
Args:
data_source - (OxidationStateMixin) - A AbstractData class that supports
the get_oxidation_state method.
fast - (boolean) whether to assume elements exist in a single oxidation state,
which can dramatically accelerate the calculation of whether an ionic compound is possible, but will miss heterovalent compounds like Fe3O4.
citations()
feature_labels()
featurize(comp)

Ionic character attributes

Args:
comp: (Composition) Composition to be featurized
Returns:
cpd_possible (bool): Indicates if a neutral ionic compound is possible max_ionic_char (float): Maximum ionic character between two atoms avg_ionic_char (float): Average ionic character
implementors()
class matminer.featurizers.composition.Miedema(struct_types='inter', ss_types='min', data_source='Miedema')

Bases: matminer.featurizers.base.BaseFeaturizer

Calculate the formation enthalpies of the intermetallic compound, solid solution and amorphous phase of a given composition, based on semi-empirical Miedema model (and some extensions), particularly for transitional metal alloys. Support elemental, binary and multicomponent alloys.

For elemental/binary alloys, the formulation is based on the original works by Miedema et al. in 1980s; For multicomponent alloys, the formulation is basically the linear combination of sub-binary systems. This is reported to work well for ternary alloys, but needs to be careful with quaternary alloys and more.
Args:
struct_types (str or list of str): default=’inter’
if str, one target structure; if list, a list of target structures. e.g. ‘inter’: intermetallic compound ‘ss’: solid solution ‘amor’: amorphous phase ‘all’: same for [‘inter’, ‘ss’, ‘amor’] [‘inter’, ‘ss’]: amorphous phase and solid solution, as an example
ss_types (str or list of str): only for ss, default=’min’
if str, one structure type of ss; if list, a list of structure types of ss. e.g. ‘fcc’: fcc solid solution ‘bcc’: bcc solid solution ‘hcp’: hcp solid solution ‘no_latt’: solid solution with no specific structure type ‘min’: min value of [‘fcc’, ‘bcc’, ‘hcp’, ‘no_latt’] ‘all’: same for [‘fcc’, ‘bcc’, ‘hcp’, ‘no_latt’] [‘fcc’, ‘bcc’]: fcc and bcc solid solutions, as an example
data_source (str): default=’Miedema’, source of dataset
‘Miedema’: read from ‘Miedema.csv’

parameterized by Miedema et al. in 1980s, containing parameters for 73 types of elements:

‘molar_volume’ ‘electron_density’ ‘electronegativity’ ‘valence_electrons’ ‘a_const’ ‘R_const’ ‘H_trans’ ‘compressibility’ ‘shear_modulus’ ‘melting_point’ ‘structural_stability’
Returns:
(list of floats) Miedema formation enthalpies (per atom)

-formation_enthalpy_inter: for intermetallic compound -formation_enthalpy_ss: for solid solution, can be divided into

‘min’, ‘fcc’, ‘bcc’, ‘hcp’, ‘no_latt’
for different lattice_types

-formation_enthalpy_amor: for amorphous phase

__init__(struct_types='inter', ss_types='min', data_source='Miedema')
citations()
data_dir = '/Users/ajain/Documents/code_matgen/matminer/matminer/featurizers/../utils/data_files'
deltaH_chem(elements, fracs, struct)

Chemical term of formation enthalpy Args:

elements (list of str): list of elements fracs (list of floats): list of atomic fractions struct (str): ‘inter’, ‘ss’ or ‘amor’
Returns:
deltaH_chem (float): chemical term of formation enthalpy
deltaH_elast(elements, fracs)

Elastic term of formation enthalpy Args:

elements (list of str): list of elements fracs (list of floats): list of atomic fractions
Returns:
deltaH_elastic (float): elastic term of formation enthalpy
deltaH_struct(elements, fracs, latt)

Structural term of formation enthalpy, only for solid solution Args:

elements (list of str): list of elements fracs (list of floats): list of atomic fractions latt (str): ‘fcc’, ‘bcc’, ‘hcp’ or ‘no_latt’
Returns:
deltaH_struct (float): structural term of formation enthalpy
deltaH_topo(elements, fracs)

Topological term of formation enthalpy, only for amorphous phase Args:

elements (list of str): list of elements fracs (list of floats): list of atomic fractions
Returns:
deltaH_topo (float): topological term of formation enthalpy
feature_labels()
featurize(comp)

Get Miedema formation enthalpies of target structures: inter, amor, ss (can be further divided into ‘min’, ‘fcc’, ‘bcc’, ‘hcp’, ‘no_latt’

for different lattice_types)
Args:
comp: Pymatgen composition object
Returns:
miedema (list of floats): formation enthalpies of target structures
implementors()
class matminer.featurizers.composition.OxidationStates(stats=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Statistics about the oxidation states for each specie. Features are concentration-weighted statistics of the oxidation states.

__init__(stats=None)
Args:
stats - (list of string), which statistics compute
citations()
feature_labels()
featurize(comp)
classmethod from_preset(preset_name)
implementors()
class matminer.featurizers.composition.Stoichiometry(p_list=(0, 2, 3, 5, 7, 10), num_atoms=False)

Bases: matminer.featurizers.base.BaseFeaturizer

Calculate stoichiometric attributes.

Parameters:
p_list (list of ints): list of norms to calculate num_atoms (bool): whether to return number of atoms per formula unit
__init__(p_list=(0, 2, 3, 5, 7, 10), num_atoms=False)
citations()
feature_labels()
featurize(comp)

Get stoichiometric attributes Args:

comp: Pymatgen composition object p_list (list of ints)
Returns:
p_norm (list of floats): Lp norm-based stoichiometric attributes.
Returns number of atoms if no p-values specified.
implementors()
class matminer.featurizers.composition.TMetalFraction

Bases: matminer.featurizers.base.BaseFeaturizer

Class to calculate fraction of magnetic transition metals in a composition.

Parameters:
data_source (data class): source from which to retrieve element data

Generates: Fraction of magnetic transition metal atoms in a compound

__init__()
citations()
feature_labels()
featurize(comp)
Args:
comp: Pymatgen Composition object
Returns:
frac_magn_atoms (single-element list): fraction of magnetic transitional metal atoms in a compound
implementors()
class matminer.featurizers.composition.ValenceOrbital(orbitals=('s', 'p', 'd', 'f'), props=('avg', 'frac'))

Bases: matminer.featurizers.base.BaseFeaturizer

Class to calculate valence orbital attributes

Parameters:

data_source (data object): source from which to retrieve element data orbitals (list): orbitals to calculate props (list): specifies whether to return average number of electrons in each orbital,

fraction of electrons in each orbital, or both
__init__(orbitals=('s', 'p', 'd', 'f'), props=('avg', 'frac'))
citations()
feature_labels()
featurize(comp)

Weighted fraction of valence electrons in each orbital

Args:
comp: Pymatgen composition object
Returns:
valence_attributes (list of floats): Average number and/or
fraction of valence electrons in specfied orbitals
implementors()
matminer.featurizers.composition.has_oxidation_states(comp)

Check if a composition object has oxidation states for each element

TODO: Does this make sense to add to pymatgen? -wardlt

Args:
comp - (Composition) Composition to check
Returns:
(Boolean) Whether this composition object contains oxidation states

matminer.featurizers.dos module

class matminer.featurizers.dos.DOSFeaturizer(contributors=1, significance_threshold=0.1, energy_cutoff=0.5, sampling_resolution=100, gaussian_smear=0.1)

Bases: matminer.featurizers.base.BaseFeaturizer

Featurizes a pymatgen density of states, CompleteDos, object.

__init__(contributors=1, significance_threshold=0.1, energy_cutoff=0.5, sampling_resolution=100, gaussian_smear=0.1)
Args:
contributors (int):
Sets the number of top contributors to the DOS that are returned as features. (i.e. contributors=1 will only return the main cb and main vb orbital)
significance_threshold (float):
Sets the significance threshold for orbitals in the DOS. Does not impact the number of contributors returned. Only determines the feature value xbm_significant_contributors. The threshold is a fractional value between 0 and 1.
energy_cutoff (float in eV):
The extent (into the bands) to sample the DOS
sampling_resolution (int):
Number of points to sample DOS
gaussian_smear (float in eV):
Gaussian smearing (sigma) around each sampled point in the DOS
feature_labels()
featurize(dos)
Args:
dos (pymatgen CompleteDos or their dict):
The density of states to featurize. Must be a complete DOS, (i.e. contains PDOS and structure, in addition to total DOS) and must contain the structure.
Returns:

xbm_score_i (float): fractions of ith contributor orbital xbm_location_i (str): fractional coordinate of ith contributor.

For example, ‘0.0;0.0;0.0’ if Gamma

xbm_specie_i: (str) elemental specie of ith contributor (ex: ‘Ti’) xbm_character_i: (str) orbital character of ith contributor (s p d or f) xbm_nsignificant: (int) the number of orbitals with contributions

above the significance_threshold
implementors()
matminer.featurizers.dos.get_cbm_vbm_scores(dos, energy_cutoff, sampling_resolution, gaussian_smear)
Quantifies the strength of the contribution of all orbitals of various
species/sites to the conduction band minimum (CBM) and the valence band maximum (VBM) up to energy_cutoff inside the bands from the CBM/VBM. An example use of the output may be sorting it based on cbm_score or vbm_score.
Args:
dos (pymatgen CompleteDos or their dict):
The density of states to featurize. Must be a complete DOS, (i.e. contains PDOS and structure, in addition to total DOS)
energy_cutoff (float in eV):
The extent (into the bands) to sample the DOS
sampling_resolution (int):
Number of points to sample DOS
gaussian_smear (float in eV):
Gaussian smearing (sigma) around each sampled point in the DOS
Returns:
orbital_scores [(dict)]:
A list of how much each orbital contributes to the partial density of states up to energy_cutoff. Dictionary items are: .. cbm_score: (float) fractional contribution to conduction band .. vbm_score: (float) fractional contribution to valence band .. species: (pymatgen Specie) the Specie of the orbital .. character: (str) is the orbital character s, p, d, or f .. location: [(float)] fractional coordinates of the orbital

matminer.featurizers.function module

class matminer.featurizers.function.FunctionFeaturizer(expressions=None, multi_feature_depth=1, postprocess=None, combo_function=None, latexify_labels=False)

Bases: matminer.featurizers.base.BaseFeaturizer

This class featurizes a dataframe according to a set of expressions representing functions to apply to existing features. The approach here has uses a sympy-based parsing of string expressions, rather than explicit python functions. The primary reason this has been done is to provide for better support for book-keeping (e. g. with feature labels), substitution, and elimination of symbolic redundancy, which sympy is well-suited for.

__init__(expressions=None, multi_feature_depth=1, postprocess=None, combo_function=None, latexify_labels=False)
Args:
expressions ([str]): list of sympy-parseable expressions
representing a function of a single variable x, e. g. [“1 / x”, “x ** 2”], defaults to the list above
multi_feature_depth (int): how many features to include if using
multiple fields for functionalization, e. g. 2 will include pairwise combined features
postprocess (function or type): type to cast functional outputs
to, if, for example, you want to include the possibility of complex numbers in your outputs, use postprocess=np.complex, defaults to float
combo_function (function): function to combine multi-features,
defaults to np.prod (i.e. cumulative product of expressions), note that a combo function must cleanly process sympy expressions
latexify_labels (bool): whether to render labels in latex,
defaults to False
citations()
exp_dict

Generates a dictionary of expressions keyed by number of variables in each expression

Returns:
Dictionary of expressions keyed by number of variables
feature_labels()
Returns:
Set of feature labels corresponding to expressions
featurize(*args)

Main featurizer function, essentially iterates over all of the functions in self.function_list to generate features for each argument.

Args:
*args: list of numbers to generate functional output
features
Returns:
list of functional outputs corresponding to input args
featurize_dataframe(df, col_id, ignore_errors=False, return_errors=False, inplace=True)

Compute features for all entries contained in input dataframe.

Args:

df (DataFrame): dataframe containing input data col_id (str or list of str): column label containing objects

to featurize, can be single or multiple column names
ignore_errors (bool): Returns NaN for dataframe rows where
exceptions are thrown if True. If False, exceptions are thrown as normal.
return_errors (bool). Returns the errors encountered for each
row in a separate XFeaturizer errors column if True. Requires ignore_errors to be True.

inplace (bool): Whether to add new columns to input dataframe (df)

Returns:
updated DataFrame
generate_string_expressions(input_variable_names)

Method to generate string expressions for input strings, mainly used to generate columns names for featurize_dataframe

Args:
input_variable_names ([str]): strings corresponding to
functional input variable names
Returns:
List of string expressions generated by substitution of variable names into functions
implementors()
matminer.featurizers.function.generate_expressions_combinations(expressions, combo_depth=2, combo_function=<function prod>)

This function takes a list of strings representing functions of x, converts them to sympy expressions, and combines them according to the combo_depth parameter. Also filters resultant expressions for any redundant ones determined by sympy expression equivalence.

Args:
expressions (strings): all of the sympy-parseable strings
to be converted to expressions and combined, e. g. [“1 / x”, “x ** 2”], must be functions of x

combo_depth (int): the number of independent variables to consider combo_function (method): the function which combines the

the respective expressions provided, defaults to np.prod, i. e. the cumulative product of the expressions
Returns:
list of unique non-trivial expressions for featurization
of inputs

matminer.featurizers.site module

class matminer.featurizers.site.AGNIFingerprints(directions=(None, 'x', 'y', 'z'), etas=None, cutoff=8)

Bases: matminer.featurizers.base.BaseFeaturizer

Integral of the product of the radial distribution function and a
Gaussian window function. Originally used by [Botu et al] (http://pubs.acs.org/doi/abs/10.1021/acs.jpcc.6b10908) to fit empiricial potentials. These features come in two forms: atomic fingerprints and direction-resolved fingerprints. Atomic fingerprints describe the local environment of an atom and are computed using the function: :math:`A_i(eta) = sumlimits_{i

e j} e^{-( rac{r_{ij}}{eta})^2} f(r_{ij})`

where i is the index of the atom, j is the index of a neighboring atom, \eta is a scaling function, r_{ij} is the distance between atoms i and j, and f(r) is a cutoff function where :math:`f(r) = 0.5[cos(
rac{pi r_{ij}}{R_c}) + 1]` if r < R_c:math: and 0 otherwise.
The direction-resolved fingerprints are computed using :math:`V_i^k(eta) = sumlimits_{i

e j} rac{r_{ij}^k}{r_{ij}} e^{-( rac{r_{ij}}{eta})^2} f(r_{ij})`

where r_{ij}^k is the k^{th} component of

System Message: WARNING/2 (old{r}_i - old{r}_j)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2015/01/01> Babel <3.9l> and hyphenation patterns for 79 languages loaded. (/usr/local/texlive/2015/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/local/texlive/2015/texmf-dist/tex/latex/base/size12.clo)) (/usr/local/texlive/2015/texmf-dist/tex/latex/base/inputenc.sty (/usr/local/texlive/2015/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/local/texlive/2015/texmf-dist/tex/latex/ucs/ucs.sty (/usr/local/texlive/2015/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/local/texlive/2015/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/local/texlive/2015/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/local/texlive/2015/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/local/texlive/2015/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/local/texlive/2015/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/local/texlive/2015/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/local/texlive/2015/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/local/texlive/2015/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/local/texlive/2015/texmf-dist/tex/latex/tools/bm.sty) (./math.aux) (/usr/local/texlive/2015/texmf-dist/tex/latex/ucs/ucsencs.def) (/usr/local/texlive/2015/texmf-dist/tex/latex/amsfonts/umsa.fd) (/usr/local/texlive/2015/texmf-dist/tex/latex/amsfonts/umsb.fd) ! Package inputenc Error: Keyboard character used is undefined (inputenc) in inputencoding `utf8x'. See the inputenc package documentation for explanation. Type H <return> for immediate help. ... l.12 $^^H old{r}_i - ^^Hold{r}_j$ ! Package inputenc Error: Keyboard character used is undefined (inputenc) in inputencoding `utf8x'. See the inputenc package documentation for explanation. Type H <return> for immediate help. ... l.12 $^^Hold{r}_i - ^^H old{r}_j$ [1] (./math.aux) ) (see the transcript file for additional information) Output written on math.dvi (1 page, 328 bytes). Transcript written on math.log.
. Parameters: TODO: Differentiate between different atom types (maybe as another class)
__init__(directions=(None, 'x', 'y', 'z'), etas=None, cutoff=8)
Args:
directions (iterable): List of directions for the fingerprints. Can
be one or more of ‘None`, ‘x’, ‘y’, or ‘z’

etas (iterable of floats): List of which window widths to compute cutoff (float): Cutoff distance (Angstroms)

citations()
feature_labels()
featurize(struct, idx)
implementors()
class matminer.featurizers.site.AngularFourierSeries(bins, cutoff=10.0)

Bases: matminer.featurizers.base.BaseFeaturizer

Compute the angular Fourier series (AFS) for a site. The AFS includes both radial and angular information about site neighbors. The AFS is the product of distance functionals (g_n, g_n’) between two pairs of atoms (sharing the common central site) and the cosine of the angle between the two pairs. The AFS is a 2-dimensional feature (the axes are g_n, g_n’).

Examples of distance functionals are square functions, Gaussian, trig functions, and Bessel functions. An example for Gaussian:

lambda d: exp( -(d - d_n)**2 ), where d_n is the coefficient for g_n
There are two preset conditions:
gaussian: bin functionals are gaussians histogram: bin functionals are rectangular functions
Args:
bins: (list of tuples) a list of (str, functions). The str is a text

label for each bin functional. The functions should accept scalar numpy arrays (each scalar value corresponds to a distance) and return arrays of floats.

(e.g. lambda d: exp( - a_0 * (d - b_0)**2 ))
cutoff: (float) maximum distance to look for neighbors. The
featurizer will run slowly for large distance cutoffs because of the number of neighbor pairs scales as the square of the number of neighbors
__init__(bins, cutoff=10.0)
citations()
feature_labels()
featurize(struct, idx)

Get AFS of the input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure struct.
Returns:
Flattened list of AFS values. the list order is:
g_n g_n’
static from_preset(preset, width=0.5, spacing=0.5, cutoff=10)
Preset bin functionals for this featurizer. Example use:
>>> AFS = AngularFourierSeries.from_preset('gaussian')
>>> AFS.featurize(struct, idx)
Args:
preset (str): shape of bin (either ‘gaussian’ or ‘histogram’) width (float): bin width. std dev for gaussian, width for histogram spacing (float): the spacing between bin centers cutoff (float): maximum distance to look for neighbors
implementors()
class matminer.featurizers.site.ChemEnvSiteFingerprint(cetypes, strategy, geom_finder, max_csm=8, max_dist_fac=1.41)

Bases: matminer.featurizers.base.BaseFeaturizer

Site fingerprint computed from pymatgen’s ChemEnv package that provides resemblance percentages of a given site to ideal environments. Args:

cetypes ([str]): chemical environments (CEs) to be
considered.

strategy (ChemenvStrategy): ChemEnv neighbor-finding strategy. geom_finder (LocalGeometryFinder): ChemEnv local geometry finder. max_csm (float): maximum continuous symmetry measure (CSM;

default of 8 taken from chemenv). Note that any CSM larger than max_csm will be set to max_csm in order to avoid negative values (i.e., all features are constrained to be between 0 and 1).

max_dist_fac (float): maximum distance factor (default: 1.41).

__init__(cetypes, strategy, geom_finder, max_csm=8, max_dist_fac=1.41)
citations()
feature_labels()
featurize(struct, idx)

Get ChemEnv fingerprint of site with given index in input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure struct.
Returns:
(numpy array): resemblance fraction of target site to ideal
local environments.
static from_preset(preset)

Use a standard collection of CE types and choose your ChemEnv neighbor-finding strategy. Args:

preset (str): preset types (“simple” or
“multi_weights”).
Returns:
ChemEnvSiteFingerprint object from a preset.
implementors()
class matminer.featurizers.site.ChemicalSRO(nn, includes=None, excludes=None, sort=True)

Bases: matminer.featurizers.base.BaseFeaturizer

Chemical short-range ordering (SRO) features to evaluate the deviation of local chemistry with the nominal composition of the structure. f_el = N_el/(sum of N_el) - c_el, where N_el is the number of each element type in the neighbors around the target site, sum of N_el is the sum of all possible element types (coordination number), and c_el is the composition of the specific element in the entire structure. A positive f_el indicates the “bonding” with the specific element is favored, at least in the target site; A negative f_el indicates the “bonding” is not favored, at least in the target site.

Note that ChemicalSRO is only featurized for elements identified by “fit” (see following), thus “fit” must be called before “featurize”, or else an error will be raised. Args:

nn (NearestNeighbor): instance of one of pymatgen’s Nearest Neighbor
classes.

includes (array-like or str): elements included to calculate CSRO. excludes (array-like or str): elements excluded to calculate CSRO. sort (bool): whether to sort elements by mendeleev number.

__init__(nn, includes=None, excludes=None, sort=True)
citations()
feature_labels()
featurize(struct, idx)

Get CSRO features of site with given index in input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure.
Returns:
(list of floats): Chemical SRO features for each element.
fit(X, y=None)

Identify elements to be included in the following featurization, by intersecting the elements present in the passed structures with those explicitly included (or excluded) in __init__. Only elements in the self.el_list_ will be featurized. Besides, compositions of the passed structures will also be “stored” in a dict of self.el_amt_dict_, avoiding repeated calculation of composition when featurizing multiple sites in the same structure. Args:

X (array-like): containing Pymatgen structures and sites, supports

multiple choices: -2D array-like object:

e.g. [[struct, site], [struct, site], …]
np.array([[struct, site], [struct, site], …])
-Pandas dataframe:
e.g. df[[‘struct’, ‘site’]]

y : unused (added for consistency with overridden method signature)

Returns:
self
static from_preset(preset, **kwargs)

Use one of the standard instances of a given NearNeighbor class. Args:

preset (str): preset type (“VoronoiNN”, “JMolNN”,
“MiniumDistanceNN”, “MinimumOKeeffeNN”, or “MinimumVIRENN”).

**kwargs: allow to pass args to the NearNeighbor class.

Returns:
ChemicalSRO from a preset.
implementors()
class matminer.featurizers.site.CoordinationNumber(nn, use_weights=False)

Bases: matminer.featurizers.base.BaseFeaturizer

Coordination number (CN) computed using one of pymatgen’s NearNeighbor classes for determination of near neighbors contributing to the CN. Args:

nn (NearNeighbor): instance of one of pymatgen’s NearNeighbor
classes.
__init__(nn, use_weights=False)
citations()
feature_labels()
featurize(struct, idx)

Get coordintion number of site with given index in input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure struct.
Returns:
(float): coordination number.
static from_preset(preset, **kwargs)

Use one of the standard instances of a given NearNeighbor class. Args:

preset (str): preset type (“VoronoiNN”, “JMolNN”,
“MiniumDistanceNN”, “MinimumOKeeffeNN”, or “MinimumVIRENN”).

**kwargs: allow to pass args to the NearNeighbor class.

Returns:
CoordinationNumber from a preset.
implementors()
class matminer.featurizers.site.CrystalSiteFingerprint(optypes, override_cn1=True, cutoff_radius=8, tol=0.01, cation_anion=False)

Bases: matminer.featurizers.base.BaseFeaturizer

A site fingerprint intended for periodic crystals. The fingerprint represents the value of various order parameters for the site; each value is the product two quantities: (i) the value of the order parameter itself and (ii) a factor that describes how consistent the number of neighbors is with that order parameter. Note that we can include only factor (ii) using the “wt” order parameter which is always set to 1.

__init__(optypes, override_cn1=True, cutoff_radius=8, tol=0.01, cation_anion=False)

Initialize the CrystalSiteFingerprint. Use the from_preset() function to use default params. Args:

optypes (dict): a dict of coordination number (int) to a list of str
representing the order parameter types
override_cn1 (bool): whether to use a special function for the single
neighbor case. Suggest to keep True.

cutoff_radius (int): radius in Angstroms for neighbor finding tol (float): numerical tolerance (in case your site distances are

not perfect or to correct for float tolerances)
cation_anion (bool): whether to only consider cation<->anion bonds
(bonds with zero charge are also allowed)
citations()
feature_labels()
featurize(struct, idx)

Get crystal fingerprint of site with given index in input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure.
Returns:
list of weighted order parameters of target site.
static from_preset(preset, cation_anion=False)

Use preset parameters to get the fingerprint Args:

preset (str): name of preset (“cn” or “ops”) cation_anion (bool): whether to only consider cation<->anion bonds

(bonds with zero charge are also allowed)
implementors()
class matminer.featurizers.site.EwaldSiteEnergy(accuracy=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Compute site energy from Coulombic interactions User notes:

  • This class uses that charges that are already-defined for the structure.
  • Ewald summations can be expensive. If you evaluating every site in many large structures, run all of the sites for each structure at the same time. We cache the Ewald result for the structure that was run last, so looping over sites and then structures is faster than structures than sites.
Features:
ewald_site_energy - Energy for the site computed from Coulombic interactions
__init__(accuracy=None)
Args:
accuracy (int): Accuracy of Ewald summation, number of decimal places
citations()
feature_labels()
featurize(strc, idx)
Args:
struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure.
Returns:
([float]) - Electrostatic energy of the site
implementors()
class matminer.featurizers.site.GaussianSymmFunc(etas_g2=None, etas_g4=None, zetas_g4=None, gammas_g4=None, cutoff=6.5)

Bases: matminer.featurizers.base.BaseFeaturizer

Gaussian symmetry function features suggested by Behler et al., based on pair distances and angles, to approximate the functional dependence of local energies, originally used in the fitting of machine-learning potentials. The symmetry functions can be divided to a set of radial functions (g2 function), and a set of angular functions (g4 function). The number of symmetry functions returned are based on parameters of etas_g2, etas_g4, zetas_g4 and gammas_g4. See the original papers for more details: “Atom-centered symmetry functions for constructing high-dimensional neural network potentials”, J Behler, J Chem Phys 134, 074106 (2011). The cutoff function is taken as the polynomial form (cosine_cutoff) to give a smoothed truncation. A Fortran and a different Python version can be found in the code Amp: Atomistic Machine-learning Package (https://bitbucket.org/andrewpeterson/amp). Args:

etas_g2 (list of floats): etas used in radial functions.
(default: [0.05, 4., 20., 80.])
etas_g4 (list of floats): etas used in angular functions.
(default: [0.005])
zetas_g4 (list of floats): zetas used in angular functions.
(default: [1., 4.])
gammas_g4 (list of floats): gammas used in angular functions.
(default: [+1., -1.])

cutoff (float): cutoff distance. (default: 6.5)

__init__(etas_g2=None, etas_g4=None, zetas_g4=None, gammas_g4=None, cutoff=6.5)
citations()
static cosine_cutoff(r, cutoff)

Polynomial cutoff function to give a smoothed truncation of the Gaussian symmetry functions. Args:

r (float): distance. cutoff (float): cutoff distance.
Returns:
(float) cutoff function.
feature_labels()
featurize(struct, idx)

Get Gaussian symmetry function features of site with given index in input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure.
Returns:
(list of floats): Gaussian symmetry function features.
static g2(eta, center_coord, neigh_coords, cutoff)

Gaussian radial symmetry function of the center atom, given an eta parameter. Args:

eta: radial function parameter. center_coord (list of floats): coordinates of center atom. neigh_coords (list of [floats]): coordinates of neighboring atoms. cutoff (float): cutoff distance.
Returns:
(float) Gaussian radial symmetry function.
static g4(eta, zeta, gamma, center_coord, neigh_coords, cutoff)

Gaussian angular symmetry function of the center atom, given a set of eta, zeta and gamma parameters. Args:

eta (float): angular function parameter. zeta (float): angular function parameter. gamma (float): angular function parameter. center_coord (list of floats): coordinates of center atom. neigh_coords (list of [floats]): coordinates of neighboring atoms. cutoff (float): cutoff parameter.
Returns:
(float) Gaussian angular symmetry function.
implementors()
class matminer.featurizers.site.GeneralizedRadialDistributionFunction(bins, cutoff=20.0, mode='GRDF')

Bases: matminer.featurizers.base.BaseFeaturizer

Compute the general radial distribution function (GRDF) for a site. The GRDF is a radial measure of crystal order around a site. There are two featurizing modes:

  1. GRDF: (recommended) - n_bins length vector
    In GRDF mode, The GRDF is computed by considering all sites around a central site (i.e., no sites are omitted when computing the GRDF). The features output from this mode will be vectors with length n_bins.
  2. pairwise GRDF: (advanced users) - n_bins x n_sites matrix
    In this mode, GRDFs are are still computed around a central site, but only one other site (and their translational equivalents) are used to compute a GRDF (e.g. site 1 with site 2 and the translational equivalents of site 2). This results in a a n_sites x n_bins matrix of features. Requires fit for determining the max number of sites for

The GRDF is a generalization of the partial radial distribution function (PRDF). In contrast with the PRDF, the bins of the GRDF are not mutually- exclusive and need not carry a constant weight of 1. The PRDF is a case of the GRDF when the bins are rectangular functions. Examples of other functions to use with the GRDF are Gaussian, trig, and Bessel functions.

There are two preset conditions:
gaussian: bin functionals are gaussians histogram: bin functionals are rectangular functions
Args:
bins: (list of tuples) a list of (str, functions). The str is a text
label for each bin functional. The functions should accept scalar numpy arrays (each scalar value corresponds to a distance) and return arrays of floats. (e.g. lambda d: exp( a_0 * (d - b_0)**2 ))

cutoff: (float) maximum distance to look for neighbors mode: (str) the featurizing mode. supported options are:

‘GRDF’ and ‘pairwise_GRDF’
__init__(bins, cutoff=20.0, mode='GRDF')
citations()
feature_labels()
featurize(struct, idx)

Get GRDF of the input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure struct.
Returns:
Flattened list of GRDF values. For each run mode the list order is:
GRDF: bin# pairwise GRDF: site2# bin#

The site2# corresponds to a pymatgen site index and bin# corresponds to one of the bin functionals

fit(X, y=None, **fit_kwargs)

Determine the maximum number of sites in X to assign correct feature labels

Args:
X - [list of tuples], training data
tuple values should be (struc, idx)
Returns:
self
static from_preset(preset, width=0.5, spacing=0.5, cutoff=10, mode='GRDF')
Preset bin functionals for this featurizer. Example use:
>>> GRDF = GeneralizedRadialDistributionFunction.from_preset('gaussian')
>>> GRDF.featurize(struct, idx)
Args:
preset (str): shape of bin (either ‘gaussian’ or ‘histogram’) width (float): bin width. std dev for gaussian, width for histogram spacing (float): the spacing between bin centers cutoff (float): maximum distance to look for neighbors mode (str): featurizing mode. either ‘GRDF’ or ‘pairwise_GRDF’
implementors()
class matminer.featurizers.site.OPSiteFingerprint(target_motifs=None, dr=0.1, ddr=0.01, ndr=1, dop=0.001, dist_exp=2, zero_ops=True)

Bases: matminer.featurizers.base.BaseFeaturizer

Local structure order parameters computed from the neighbor environment of a site. For each order parameter, we determine the neighbor shell that complies with the expected coordination number. For example, we find the 4 nearest neighbors for the tetrahedral OP, the 6 nearest for the octahedral OP, and the 8 nearest neighbors for the bcc OP. If we don’t find such a shell, the OP is either set to zero or evaluated with the shell of the next largest observed coordination number. Args:

target_motifs (dict): target op or motif type where keys
are corresponding coordination numbers (e.g., {4: “tetrahedral”}).
dr (float): width for binning neighbors in unit of relative
distances (= distance/nearest neighbor distance). The binning is necessary to make the neighbor-finding step robust against small numerical variations in neighbor distances (default: 0.1).

ddr (float): variation of width for finding stable OP values. ndr (int): number of width variations for each variation direction

(e.g., ndr = 0 only uses the input dr, whereas ndr=1 tests dr = dr - ddr, dr, and dr + ddr.
dop (float): binning width to compute histogram for each OP
if ndr > 0.
dist_exp (boolean): exponent for distance factor to multiply
order parameters with that penalizes (large) variations in distances in a given motif. 0 will switch the option off (default: 2).
zero_ops (boolean): set an OP to zero if there is no neighbor
shell that complies with the expected coordination number of a given OP (e.g., CN=4 for tetrahedron; default: True).
__init__(target_motifs=None, dr=0.1, ddr=0.01, ndr=1, dop=0.001, dist_exp=2, zero_ops=True)
citations()
feature_labels()
featurize(struct, idx)

Get OP fingerprint of site with given index in input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure.
Returns:
opvals (numpy array): order parameters of target site.
implementors()
class matminer.featurizers.site.VoronoiFingerprint(cutoff=6.5, use_weights=False, stats_vol=None, stats_area=None, stats_dist=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Calculate the following sets of features based on Voronoi tessellation analysis around the target site: Voronoi indices

n_i denotes the number of i-edged facets, and i is in the range of 3-10. e.g. for bcc lattice, the Voronoi indices are [0,6,0,8,…]; for fcc/hcp lattice, the Voronoi indices are [0,12,0,0,…]; for icosahedra, the Voronoi indices are [0,0,12,0,…];
i-fold symmetry indices

computed as n_i/sum(n_i), and i is in the range of 3-10. reflect the strength of i-fold symmetry in local sites. e.g. for bcc lattice, the i-fold symmetry indices are [0,6/14,0,8/14,…]

indicating both 4-fold and a stronger 6-fold symmetries are present;
for fcc/hcp lattice, the i-fold symmetry factors are [0,1,0,0,…],
indicating only 4-fold symmetry is present;
for icosahedra, the Voronoi indices are [0,0,1,0,…],
indicating only 5-fold symmetry is present;
Weighted i-fold symmetry indices
if use_weights = True
Voronoi volume
total volume of the Voronoi polyhedron around the target site
Voronoi volume statistics of sub_polyhedra formed by each facet + center
e.g. stats_vol = [‘mean’, ‘std_dev’, ‘minimum’, ‘maximum’]
Voronoi area
total area of the Voronoi polyhedron around the target site
Voronoi area statistics of the facets
e.g. stats_area = [‘mean’, ‘std_dev’, ‘minimum’, ‘maximum’]
Voronoi nearest-neighboring distance statistics
e.g. stats_dist = [‘mean’, ‘std_dev’, ‘minimum’, ‘maximum’]
Args:
cutoff (float): cutoff distance in determining the potential
neighbors for Voronoi tessellation analysis. (default: 6.5)
use_weights(bool): whether to use weights to derive weighted
i-fold symmetry indices.

stats_vol (list of str): volume statistics types. stats_area (list of str): area statistics types. stats_dist (list of str): neighboring distance statistics types.

__init__(cutoff=6.5, use_weights=False, stats_vol=None, stats_area=None, stats_dist=None)
citations()
feature_labels()
featurize(struct, idx)

Get Voronoi fingerprints of site with given index in input structure. Args:

struct (Structure): Pymatgen Structure object. idx (int): index of target site in structure.
Returns:
(list of floats): Voronoi fingerprints.
-Voronoi indices -i-fold symmetry indices -weighted i-fold symmetry indices (if use_weights = True) -Voronoi volume -Voronoi volume statistics -Voronoi area -Voronoi area statistics -Voronoi area statistics
implementors()
static vol_tetra(vt1, vt2, vt3, vt4)

Calculate the volume of a tetrahedron, given the four vertices of vt1, vt2, vt3 and vt4. Args:

vt1 (array-like): coordinates of vertex 1. vt2 (array-like): coordinates of vertex 2. vt3 (array-like): coordinates of vertex 3. vt4 (array-like): coordinates of vertex 4.
Returns:
(float): volume of the tetrahedron.

matminer.featurizers.stats module

class matminer.featurizers.stats.PropertyStats

Bases: object

This class contains statistical operations that are commonly employed when computing features.

The primary way for interacting with this class is to call the calc_stat function, which takes the name of the statistic you would like to compute and the weights/values of data to be assessed. For example, computing the mean of a list looks like:

x = [1, 2, 3]
PropertyStats.calc_stat(x, 'mean') # Result is 2
PropertyStats.calc_stat(x, 'mean', weights=[0, 0, 1]) # Result is 3

Some of the statistics functions take options (e.g., Holder means). You can pass them to the the statistics functions by adding them after the name and two colons. For example, the 0th Holder mean would be:

PropertyStats.calc_stat(x, 'holder_mean::0')

You can, of course, call the statistical functions directly. All take at least two arguments. The first is the data being assessed and the second, optional, argument is the weights.

static avg_dev(data_lst, weights=None)

Mean absolute deviation of list of element data.

This is computed by first calculating the mean of the list, and then computing the average absolute difference between each value and the mean.

Args:
data_lst (list of floats): List of values to be assessed weights (list of floats): Weights for each value
Returns:
mean absolute deviation
static calc_stat(data_lst, stat, weights=None)

Compute a property statistic

Args:

data_lst (list of floats): list of values stat (str) - Name of property to be compute. If there are arguments to the statistics function, these

should be added after the name and separated by two colons. For example, the 2nd Holder mean would be “holder_mean::2”

weights (list of floats): (Optional) weights for each element in data_lst

Returns:
float - Desired statistic
static eigenvalues(data_lst, symm=False, sort=False)

Return the eigenvalues of a matrix as a numpy array Args:

data_lst: (matrix-like) of values symm: whether to assume the matrix is symmetric sort: wheter to sort the eigenvalues

Returns: eigenvalues

static flatten(data_lst)

Returns a flattened copy of data_lst-as a numpy array

static geom_std_dev(data_lst, weights=None)

Geometric standard deviation

Args:
data_lst (list of floats): List of values to be assessed weights (list of floats): Weights for each value
Returns:
geometric standard deviation
static holder_mean(data_lst, weights=None, power=1)

Get Holder mean Args:

data_lst: (list/array) of values weights: (list/array) of weights power: (int/float/str) which holder mean to compute

Returns: Holder mean

static inverse_mean(data_lst, weights=None)

Mean of the inverse of each entry

Args:
data_lst (list of floats): List of values to be assessed weights (list of floats): Weights for each value
Returns:
inverse mean
static maximum(data_lst, weights=None)

Maximum value in a list

Args:
data_lst (list of floats): List of values to be assessed weights: (ignored)
Returns:
maximum value
static mean(data_lst, weights=None)

Arithmetic mean of list

Args:
data_lst (list of floats): List of values to be assessed weights (list of floats): Weights for each value
Returns:
mean value
static minimum(data_lst, weights=None)

Minimum value in a list

Args:
data_lst (list of floats): List of values to be assessed weights: (ignored)
Returns:
minimum value
static mode(data_lst, weights=None)

Mode of a list of data.

If multiple elements occur equally-frequently (or same weight, if weights are provided), this function will return the minimum of those values.

Args:
data_lst (list of floats): List of values to be assessed weights (list of floats): Weights for each value
Returns:
mode
static range(data_lst, weights=None)

Range of a list

Args:
data_lst (list of floats): List of values to be assessed weights: (ignored)
Returns:
range
static sorted(data_lst)

Returns the sorted data_lst

static std_dev(data_lst, weights=None)

Standard deviation of a list of element data

Args:
data_lst (list of floats): List of values to be assessed weights (list of floats): Weights for each value
Returns:
standard deviation

matminer.featurizers.structure module

class matminer.featurizers.structure.BagofBonds(nn, bbv=0, no_oxi=False, approx_bonds=False, token=' - ', allowed_bonds=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Compute the number of each kind of bond in a structure, as a fraction of the total number of bonds, based on NearestNeighbors.

For example, in a structure with 2 Li-O bonds and 3 Li-P bonds:

Li-0: 0.4 Li-P: 0.6

Features:

BagofBonds must be fit with iterable of structures before featurization in order to define the allowed bond types (features). To do this, pass a list of allowed_bonds. Otherwise, fit based on a list of structures. If allowed_bonds is defined and BagofBonds is also fit, the intersection of the two lists of possible bonds is used.

For dataframes containing structures of various compositions, a unified dataframe is returned which has the collection of all possible bond types gathered from all structures as columns. To approximate bonds based on chemical rules (ie, for a structure which you’d like to featurize but has bonds not in the allowed set), use approx_bonds = True.

Args:
nn (NearestNeighbors): A Pymatgen nearest neighbors derived object. For
example, pymatgen.analysis.local_env.VoronoiNN().
bbv (float): The ‘bad bond values’, values substituted for
structure-bond combinations which can not physically exist, but exist in the unified dataframe. For example, if a dataframe contains structures of BaLiP and BaTiO3, determines the value to place in the Li-P column for the BaTiO3 row; by default, is 0.
no_oxi (bool): If True, the featurizer will be agnostic to oxidation
states, which prevents oxidation states from differentiating bonds. For example, if True, Ca - O is identical to Ca2+ - O2-, Ca3+ - O-, etc., and all of them will be included in Ca - O column.
approx_bonds (bool): If True, approximates the fractions of bonds not
in allowed_bonds (forbidden bonds) with similar allowed bonds. Chemical rules are used to determine which bonds are most ‘similar’; particularly, the Euclidean distance between the 2-tuples of the bonds in Mendeleev no. space is minimized for the approximate bond chosen.
token (str): The string used to separate species in a bond, including
spaces. The token must contain at least one space and cannot have alphabetic characters in it, and should be padded by spaces. For example, for the bond Cs+ - Cl-, the token is ‘ - ‘. This determines how bonds are represented in the dataframe.
allowed_bonds ([str]): A listlike object containing bond types as
strings. For example, Cs - Cl, or Li+ - O2-. Ions and elements will still have distinct bonds if (1) the bonds list originally contained them and (2) no_oxi is False. These must match the token specified.
__init__(nn, bbv=0, no_oxi=False, approx_bonds=False, token=' - ', allowed_bonds=None)
citations()
enumerate_all_bonds(structures)

Identify all the unique, possible bonds types of all structures present, and create the ‘unified’ bonds list.

Args:
structures (list/ndarray): List of pymatgen Structures
Returns:
A tuple of unique, possible bond types for an entire list of structures. This tuple is used to form the unified feature labels.
enumerate_bonds(s)

Lists out all the bond possibilities in a single structure.

Args:
s (Structure): A pymatgen structure
Returns:
A list of bond types in ‘Li-O’ form, where the order of the elements in each bond type is alphabetic.
feature_labels()

Returns the list of allowed bonds. Throws an error if the featurizer has not been fit.

featurize(s)

Quantify the fractions of each bond type in a structure.

For collections of structures, bonds types which are not found in a particular structure (e.g., Li-P in BaTiO3) are represented as NaN.

Args:
s (Structure): A pymatgen Structure object
Returns:
(list) The feature list of bond fractions, in the order of the
alphabetized corresponding bond names.
fit(X, y=None)

Define the bond types allowed to be returned during each featurization. Bonds found during featurization which are not allowed will be omitted from the returned dataframe or matrix.

Fit BagofBonds by either passing an iterable of structures to training_data or by defining the bonds explicitly with allowed_bonds in __init__.

Args:
X (Series/list): An iterable of pymatgen Structure
objects which will be used to determine the allowed bond types.

y : unused (added for consistency with overridden method signature)

Returns:
self
static from_preset(preset, **kwargs)

Use one of the standard instances of a given NearNeighbor class. Pass args to __init__, such as allowed_bonds, using this method as well.

Args:
preset (str): preset type (“VoronoiNN”, “JMolNN”, “MiniumDistanceNN”, “MinimumOKeeffeNN”, or “MinimumVIRENN”).
Returns:
CoordinationNumber from a preset.
implementors()
class matminer.featurizers.structure.CoulombMatrix(diag_elems=True)

Bases: matminer.featurizers.base.BaseFeaturizer

Generate the Coulomb matrix, M, of the input structure (or molecule). The Coulomb matrix was put forward by Rupp et al. (Phys. Rev. Lett. 108, 058301, 2012) and is defined by off-diagonal elements M_ij = Z_i*Z_j/|R_i-R_j| and diagonal elements 0.5*Z_i^2.4, where Z_i and R_i denote the nuclear charge and the position of atom i, respectively.

Args:
diag_elems: (bool) flag indicating whether (True, default) to use
the original definition of the diagonal elements; if set to False, the diagonal elements are set to zero.
__init__(diag_elems=True)
citations()
feature_labels()
featurize(s)

Get Coulomb matrix of input structure.

Args:
s: input Structure (or Molecule) object.
Returns:
m: (Nsites x Nsites matrix) Coulomb matrix.
implementors()
class matminer.featurizers.structure.DensityFeatures(desired_features=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Calculates density and density-like features: density, volume per atom (“vpa”), and packing fraction

__init__(desired_features=None)
Parameters:desired_features – [str] - choose from “density”, “vpa”, “packing fraction”
citations()
feature_labels()
featurize(s)
implementors()
class matminer.featurizers.structure.ElectronicRadialDistributionFunction(cutoff=None, dr=0.05)

Bases: matminer.featurizers.base.BaseFeaturizer

Calculate the crystal structure-inherent electronic radial distribution function (ReDF) according to Willighagen et al., Acta Cryst., 2005, B61, 29-36. The ReDF is a structure-integral RDF (i.e., summed over all sites) in which the positions of neighboring sites are weighted by electrostatic interactions inferred from atomic partial charges. Atomic charges are obtained from the ValenceIonicRadiusEvaluator class. Args:

cutoff: (float) distance up to which the ReDF is to be
calculated (default: longest diagaonal in primitive cell).

dr: (float) width of bins (“x”-axis) of ReDF (default: 0.05 A).

__init__(cutoff=None, dr=0.05)
citations()
feature_labels()
featurize(s)

Get ReDF of input structure.

Args:
s: input Structure object.
Returns: (dict) a copy of the electronic radial distribution
functions (ReDF) as a dictionary. The distance list (“x”-axis values of ReDF) can be accessed via key ‘distances’; the ReDF itself is accessible via key ‘redf’.
implementors()
class matminer.featurizers.structure.EwaldEnergy(accuracy=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Compute the energy from Coulombic interactions

Note: The energy is computed using _charges already defined for the structure_.

Features:
ewald_energy - Coulomb interaction energy of the structure
__init__(accuracy=None)
Args:
accuracy (int): Accuracy of Ewald summation, number of decimal places
citations()
feature_labels()
featurize(strc)
Args:
(Structure) - Structure being analyzed
Returns:
([float]) - Electrostatic energy of the structure
implementors()
class matminer.featurizers.structure.GlobalSymmetryFeatures(desired_features=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Determines symmetry features: spacegroup number, crystal system (1 of 7), and whether the material is centrosymmetric (has inversion symmetry)

__init__(desired_features=None)
citations()
crystal_idx = {'triclinic': 7, 'monoclinic': 6, 'orthorhombic': 5, 'tetragonal': 4, 'trigonal': 3, 'hexagonal': 2, 'cubic': 1}
feature_labels()
featurize(s)
implementors()
class matminer.featurizers.structure.MinimumRelativeDistances(cutoff=10.0)

Bases: matminer.featurizers.base.BaseFeaturizer

Determines the relative distance of each site to its closest neighbor. We use the relative distance, f_ij = r_ij / (r^atom_i + r^atom_j), as a measure rather than the absolute distances, r_ij, to account for the fact that different atoms/species have different sizes. The function uses the valence-ionic radius estimator implemented in Pymatgen. Args:

cutoff: (float) (absolute) distance up to which tentative
closest neighbors (on the basis of relative distances) are to be determined.
__init__(cutoff=10.0)
citations()
feature_labels()
featurize(s, cutoff=10.0)

Get minimum relative distances of all sites of the input structure.

Args:
s: Pymatgen Structure object.
Returns:
min_rel_dists: (list of floats) list of all minimum relative
distances (i.e., for all sites).
implementors()
class matminer.featurizers.structure.OrbitalFieldMatrix(period_tag=False)

Bases: matminer.featurizers.base.BaseFeaturizer

This function generates an orbital field matrix (OFM) as developed by Pham et al (arXiv, May 2017). Each atom is described by a 32-element vector (or 39-element vector, see period tag for details) uniquely representing the valence subshell. A 32x32 (39x39) matrix is formed by multiplying two atomic vectors. An OFM for an atomic environment is the sum of these matrices for each atom the center atom coordinates with multiplied by a distance function (In this case, 1/r times the weight of the coordinating atom in the Voronoi Polyhedra method). The OFM of a structure or molecule is the average of the OFMs for all the sites in the structure.

Args:
period_tag (bool): In the original OFM, an element is represented
by a vector of length 32, where each element is 1 or 0, which represents the valence subshell of the element. With period_tag=True, the vector size is increased to 39, where the 7 extra elements represent the period of the element. Note lanthanides are treated as period 6, actinides as period 7. Default False as in the original paper.
…attribute:: size
Either 32 or 39, the size of the vectors used to describe elements.
__init__(period_tag=False)
citations()
feature_labels()
featurize(s)

Makes a supercell for structure s (to protect sites from coordinating with themselves), and then finds the mean of the orbital field matrices of each site to characterize a structure

Args:
s (Structure): structure to characterize
Returns:
mean_ofm (size X size matrix): orbital field matrix
characterizing s
get_atom_ofms(struct, symm=False)

Calls get_single_ofm for every site in struct. If symm=True, get_single_ofm is called for symmetrically distinct sites, and counts is constructed such that ofms[i] occurs counts[i] times in the structure

Args:

struct (Structure): structure for find ofms for symm (bool): whether to calculate ofm for only symmetrically

distinct sites
Returns:

ofms ([size X size matrix] X len(struct)): ofms for struct if symm:

ofms ([size X size matrix] X number of symmetrically distinct sites):
ofms for struct

counts: number of identical sites for each ofm

get_mean_ofm(ofms, counts)

Averages a list of ofms, weights by counts

get_ohv(sp, period_tag)

Get the “one-hot-vector” for pymatgen Element sp. This 32 or 39-length vector represents the valence shell of the given element. Args:

sp (Element): element whose ohv should be returned period_tag (bool): If true, the vector contains items

corresponding to the period of the element
Returns:
my_ohv (numpy array length 39 if period_tag, else 32): ohv for sp
get_single_ofm(site, site_dict)

Gets the orbital field matrix for a single chemical environment, where site is the center atom whose environment is characterized and site_dict is a dictionary of site : weight, where the weights are the Voronoi Polyhedra weights of the corresponding coordinating sites.

Args:
site (Site): center atom site_dict (dict of Site:float): chemical environment
Returns:
atom_ofm (size X size numpy matrix): ofm for site
get_structure_ofm(struct)

Calls get_mean_ofm on the results of get_atom_ofms to give a size X size matrix characterizing a structure

implementors()
class matminer.featurizers.structure.PartialRadialDistributionFunction(cutoff=20.0, bin_size=0.1, include_elems=(), exclude_elems=())

Bases: matminer.featurizers.base.BaseFeaturizer

Compute the partial radial distribution function (PRDF) of a crystal structure, which is the radial distibution function broken down for each pair of atom types. The PRDF was proposed as a structural descriptor by [Schutt et al.] (https://journals.aps.org/prb/abstract/10.1103/PhysRevB.89.205118)

Args:
cutoff: (float) distance up to which to calculate the RDF. bin_size: (float) size of each bin of the (discrete) RDF. include_elems: (list of string), list of elements that must be included in PRDF exclude_elems: (list of string), list of elmeents that should not be included in PRDF
Features:
Each feature corresponds to the density of number of bonds
for a certain pair of elements at a certain range of distances. For example, “Al-Al PRDF r=1.00-1.50” corresponds to the density of Al-Al bonds between 1 and 1.5 distance units By default, this featurizer generates RDFs for each pair of elements in the training set.
__init__(cutoff=20.0, bin_size=0.1, include_elems=(), exclude_elems=())
citations()
compute_prdf(s)

Compute the PRDF for a structure

Args:
s: (Structure), structure to be evaluated
Returns:

dist_bins - float, start of each of the bins prdf - dict, where the keys is a pair of elements (strings),

and the value is the radial distribution function for those paris of elements
feature_labels()
featurize(s)

Get PRDF of the input structure. Args:

s: Pymatgen Structure object.
Returns:
prdf, dist: (tuple of arrays) the first element is a
dictionary where keys are tuples of element names and values are PRDFs.
fit(X, y=None)

Define the list of elements to be included in the PRDF. By default, the PRDF will include all of the elements in X

Args:
X: (numpy array nx1) structures used in the training set. Each entry
must be Pymatgen Structure objects.

y: Not used fit_kwargs: not used

Returns:
self
implementors()
class matminer.featurizers.structure.RadialDistributionFunction(cutoff=20.0, bin_size=0.1)

Bases: matminer.featurizers.base.BaseFeaturizer

Calculate the radial distribution function (RDF) of a crystal structure. Args:

cutoff: (float) distance up to which to calculate the RDF. bin_size: (float) size of each bin of the (discrete) RDF.
__init__(cutoff=20.0, bin_size=0.1)
citations()
feature_labels()
featurize(s)

Get RDF of the input structure. Args:

s: Pymatgen Structure object.
Returns:
rdf, dist: (tuple of arrays) the first element is the
normalized RDF, whereas the second element is the inner radius of the RDF bin.
implementors()
class matminer.featurizers.structure.RadialDistributionFunctionPeaks(n_peaks=2)

Bases: matminer.featurizers.base.BaseFeaturizer

Determine the location of the highest peaks in the radial distribution function (RDF) of a structure. Args:

n_peaks: (int) number of the top peaks to return .
__init__(n_peaks=2)
citations()
feature_labels()
featurize(rdf)

Get location of highest peaks in RDF.

Args:
rdf: (ndarray) RDF as obtained from the
RadialDistributionFunction class.
Returns: (ndarray) distances of highest peaks in descending order
of the peak height
implementors()
class matminer.featurizers.structure.SineCoulombMatrix(diag_elems=True)

Bases: matminer.featurizers.base.BaseFeaturizer

This function generates a variant of the Coulomb matrix developed for periodic crystals by Faber et al. (Inter. J. Quantum Chem. 115, 16, 2015). It is identical to the Coulomb matrix, except that the inverse distance function is replaced by the inverse of a sin**2 function of the vector between the sites which is periodic in the dimensions of the structure lattice. See paper for details.

Args:
diag_elems (bool): flag indication whether (True, default) to use
the original definition of the diagonal elements; if set to False, the diagonal elements are set to 0
__init__(diag_elems=True)
citations()
feature_labels()
featurize(s)
Args:
s (Structure or Molecule): input structure (or molecule)
Returns:
(Nsites x Nsites matrix) Sine matrix.
implementors()
class matminer.featurizers.structure.SiteStatsFingerprint(site_featurizer, stats=('mean', 'std_dev', 'minimum', 'maximum'), min_oxi=None, max_oxi=None)

Bases: matminer.featurizers.base.BaseFeaturizer

Calculates all order parameters (OPs) for all sites in a crystal structure. Args:

site_featurizer (BaseFeaturizer): a site-based featurizer stats ([str]): list of weighted statistics to compute for each feature.

If stats is None, for each order parameter, a list is returned that contains the calculated parameter for each site in the structure. *Note for nth mode, stat must be ‘n*_mode’; e.g. stat=‘2nd_mode’
min_oxi (int): minimum site oxidation state for inclusion (e.g.,
zero means metals/cations only)

max_oxi (int): maximum site oxidation state for inclusion

__init__(site_featurizer, stats=('mean', 'std_dev', 'minimum', 'maximum'), min_oxi=None, max_oxi=None)
citations()
feature_labels()
featurize(s)

Calculate all sites’ local structure order parameters (LSOPs).

Args:

s: Pymatgen Structure object.

Returns:
vals: (2D array of floats) LSOP values of all sites’ (1st dimension) order parameters (2nd dimension). 46 order parameters are computed per site: q_cn (coordination number), q_lin, 35 x q_bent (starting with a target angle of 5 degrees and, increasing by 5 degrees, until 175 degrees), q_tet, q_oct, q_bcc, q_2, q_4, q_6, q_reg_tri, q_sq, q_sq_pyr.
static from_preset(preset, **kwargs)
implementors()
static n_numerical_modes(data_lst, n=2, dl=0.1)
Returns the n first modes of a data set that are obtained with
a finite bin size for the underlying frequency distribution.
Args:
data_lst ([float]): data values. n (integer): number of most frequent elements to be determined. dl (float): bin size of underlying (coarsened) distribution.
Returns:
([float]): first n most frequent entries (or nan if not found).
matminer.featurizers.structure.get_op_stats_vector_diff(s1, s2, max_dr=0.2, ddr=0.01, ddist=0.01)

Determine the difference vector between two order parameter-statistics feature vector resulting from two input structures.

Args:
s1 (Structure): first input structure. s2 (Structure): second input structure. max_dr (float): maximum neighbor-finding parameter to be tested. ddr (float): step size for increasing neighbor-finding parameter. ddist (float): bin size for histogramming distances of varying dr.
Returns: (float, [float]) optimal neighbor-finding parameter
and difference vector between order parameter-statistics feature vectors obtained from the two input structures (s1 - s2).

Module contents