matminer.data_retrieval package¶
Subpackages¶
- matminer.data_retrieval.tests package
- Submodules
- matminer.data_retrieval.tests.test_retrieve_AFLOW module
- matminer.data_retrieval.tests.test_retrieve_Citrine module
- matminer.data_retrieval.tests.test_retrieve_MDF module
- matminer.data_retrieval.tests.test_retrieve_MP module
- matminer.data_retrieval.tests.test_retrieve_MPDS module
- matminer.data_retrieval.tests.test_retrieve_MongoDB module
- Module contents
Submodules¶
matminer.data_retrieval.retrieve_AFLOW module¶
matminer.data_retrieval.retrieve_Citrine module¶
matminer.data_retrieval.retrieve_MDF module¶
-
class
matminer.data_retrieval.retrieve_MDF.
MDFDataRetrieval
(anonymous=False, **kwargs)¶ Bases:
matminer.data_retrieval.retrieve_base.BaseDataRetrieval
MDFDataRetrieval is used to retrieve data from the Materials Data Facility database and convert them into a Pandas DataFrame. Note that invocation with full access to MDF will require authentication (see api_link) but an anonymous mode is supported, which can be used with anonymous=True as a keyword arg.
- Examples:
>>>mdf_dr = MDFDataRetrieval(anonymous=True) >>>results = mdf_dr.get_dataframe({“elements”:[“Ag”, “Be”], “source_names”: [“oqmd”]})
>>>results = mdf_dr.get_dataframe({“source_names”: [“oqmd”], >>> “match_ranges”: {“oqmd.band_gap.value”: [4.0, “*”]}})
If you use this data retrieval class, please additionally cite: Blaiszik, B., Chard, K., Pruyne, J., Ananthakrishnan, R., Tuecke, S., Foster, I., 2016. The Materials Data Facility: Data Services to Advance Materials Science Research. JOM 68, 2045–2052. https://doi.org/10.1007/s11837-016-2001-3
-
__init__
(anonymous=False, **kwargs)¶ - Args:
- anonymous (bool): whether to use anonymous login (i. e. no
globus authentication)
- **kwargs: kwargs for Forge, including index (globus search index
to search on), local_ep, anonymous
-
api_link
()¶ The link to comprehensive API documentation or data source.
- Returns:
(str): A link to the API documentation for this DataRetrieval class.
-
get_data
(squery, unwind_arrays=True, **kwargs)¶ Gets a dataframe from the MDF API from an explicit string query (rather than input args like get_dataframe).
- Args:
squery (str): String for explicit query unwind_arrays (bool): whether or not to unwind arrays in
flattening docs for dataframe
**kwargs: kwargs for query
- Returns:
dataframe corresponding to query
-
get_dataframe
(criteria, properties=None, unwind_arrays=True)¶ Retrieves data from the MDF API and formats it as a Pandas Dataframe
- Args:
- criteria (dict): options for keys are
source_names ([str]): source names to include, e. g. [“oqmd”] elements ([str]): elements to include, e. g. [“Ag”, “Si”] titles ([str]): titles to include, e. g. [“Coarsening of a
semisolid Al-Cu alloy”]
tags ([str]): tags to include, e. g. [“outcar”] resource_types ([str]): resources to include, e. g. [“record”] match_fields ({}): field-value mappings to include, e. g.
{“oqmd.converged”: True}
- exclude_fields ({}): field-value mappings to exclude, e. g.
{“oqmd.converged”: False}
- match_ranges ({}): field-range mappings to include, e. g.
{“oqmd.band_gap.value”: [1, 5]}, use “*” for no lower or upper bound, e. g. {“oqdm.band_gap.value”: [1, “*”]},
- exclude_ranges ({}): field-range mapping to exclude,
{“oqmd.band_gap.value”: [3, “*”]} to exclude all results with band gap higher than 3.
- raw (bool): whether or not to return raw (non-dataframe)
output, defaults to False
- unwind_arrays (bool): whether or not to unwind arrays in
flattening docs for dataframe
- Returns (pandas.DataFrame):
DataFrame corresponding to all documents from aggregated query
-
matminer.data_retrieval.retrieve_MDF.
make_dataframe
(docs, unwind_arrays=True)¶ Formats raw docs returned from MDF API search into a dataframe
- Args:
- docs [{}]: list of documents from forge search
or aggregation
Returns: DataFrame corresponding to formatted docs
matminer.data_retrieval.retrieve_MP module¶
-
class
matminer.data_retrieval.retrieve_MP.
MPDataRetrieval
(api_key=None)¶ Bases:
matminer.data_retrieval.retrieve_base.BaseDataRetrieval
Retrieves data from the Materials Project database.
If you use this data retrieval class, please additionally cite:
Ong, S.P., Cholia, S., Jain, A., Brafman, M., Gunter, D., Ceder, G., Persson, K.A., 2015. The Materials Application Programming Interface (API): A simple, flexible and efficient API for materials data based on REpresentational State Transfer (REST) principles. Computational Materials Science 97, 209–215. https://doi.org/10.1016/j.commatsci.2014.10.037
-
__init__
(api_key=None)¶ - Args:
- api_key: (str) Your Materials Project API key, or None if you’ve
set up your pymatgen config.
-
api_link
()¶ The link to comprehensive API documentation or data source.
- Returns:
(str): A link to the API documentation for this DataRetrieval class.
-
get_data
(criteria, properties, mp_decode=False, index_mpid=True)¶ - Args:
- criteria: (str/dict) see MPRester.query() for a description of this
parameter. String examples: “mp-1234”, “Fe2O3”, “Li-Fe-O’, “*2O3”. Dict example: {“band_gap”: {“$gt”: 1}}
- properties: (list) see MPRester.query() for a description of this
parameter. Example: [“formula”, “formation_energy_per_atom”]
- mp_decode: (bool) see MPRester.query() for a description of this
parameter. Whether to decode to a Pymatgen object where possible.
- index_mpid: (bool) Whether to set the materials_id as the dataframe
index.
- Returns ([dict]):
a list of jsons that match the criteria and contain properties
-
get_dataframe
(criteria, properties, index_mpid=True, **kwargs)¶ Gets data from MP in a dataframe format. See api_link for more details.
- Args:
criteria (dict): the same as in get_data properties ([str]): the same properties supported as in get_data
plus: “structure”, “initial_structure”, “final_structure”, “bandstructure” (line mode), “bandstructure_uniform”, “phonon_bandstructure”, “phonon_ddb”, “phonon_bandstructure”, “phonon_dos”. Note that for a long list of compounds, it may take a long time to retrieve some of these objects.
index_mpid (bool): the same as in get_data kwargs (dict): the same keyword arguments as in get_data
Returns (pandas.Dataframe):
-
try_get_prop_by_material_id
(prop, material_id_list, **kwargs)¶ Call the relevant get_prop_by_material_id. “prop” is a property such as bandstructure that is not readily available in supported properties of the get_data function but via the get_bandstructure_by_material_id method for example.
- Args:
- prop (str): the name of the property. Options are:
“bandstructure”, “dos”, “phonon_dos”, “phonon_bandstructure”, “phonon_ddb”
material_id_list ([str]): list of material_id of compounds kwargs (dict): other keyword arguments that get_*_by_material_id
may have; e.g. line_mode in get_bandstructure_by_material_id
- Returns ([target prop object or NaN]):
If the target property is not available for a certain material_id, NaN is returned.
-
matminer.data_retrieval.retrieve_MPDS module¶
matminer.data_retrieval.retrieve_MongoDB module¶
-
class
matminer.data_retrieval.retrieve_MongoDB.
MongoDataRetrieval
(coll)¶ Bases:
matminer.data_retrieval.retrieve_base.BaseDataRetrieval
-
__init__
(coll)¶ Retrieves data from a MongoDB collection to a pandas.Dataframe object
- Args:
coll: A MongoDB collection object
-
api_link
()¶ The link to comprehensive API documentation or data source.
- Returns:
(str): A link to the API documentation for this DataRetrieval class.
-
get_dataframe
(criteria, properties=None, limit=0, sort=None, idx_field=None, strict=False)¶ - Args:
criteria: (dict) - a pymongo-style query to filter data records properties: ([str] or None) - a list of str fields to retrieve;
dot-notation is allowed (e.g. “structure.lattice.a”). Set to “None” to try to auto-detect the fields.
limit: (int) - max number of entries. 0 means no limit sort: (tuple) - pymongo-style sort option idx_field: (str) - name of field to use as index (must have unique
entries)
strict: (bool) - if False, replaces missing values with NaN
Returns (pandas.DataFrame):
-
-
matminer.data_retrieval.retrieve_MongoDB.
clean_projection
(projection)¶ Projecting on e.g. ‘a.b.’ and ‘a’ is disallowed in MongoDb, so project inclusively. See unit tests for examples of what this is doing.
- Args:
projection: (list) - list of fields to retrieve; dot-notation is allowed.
-
matminer.data_retrieval.retrieve_MongoDB.
is_int
(x)¶
-
matminer.data_retrieval.retrieve_MongoDB.
remove_ints
(projection)¶ Transforms a string like “a.1.x” to “a.x” - for Mongo projection purposes
- Args:
projection: (str) the projection to remove ints from
Returns (str)
matminer.data_retrieval.retrieve_base module¶
-
class
matminer.data_retrieval.retrieve_base.
BaseDataRetrieval
¶ Bases:
object
Abstract class to retrieve data from various material APIs while adhering to a quasi-standard format for querying.
## Implementing a new DataRetrieval class
If you have an API which you’d like to incorporate into matminer’s data retrieval tools, using BaseDataRetrieval is the preferred way of doing so. All DataRetrieval classes should subclass BaseDataRetrieval and implement the following:
get_dataframe()
api_link()
Retrieving data should be done by the user with get_dataframe. Criteria should be a dictionary which will be used to form a query to the database. Properties should be a list which defines the columns that will be returned. While the ‘criteria’ and ‘properties’ arguments may have different valid values depending on the database, they should always have sensible formats and names if possible. For example, the user should be calling this:
- df = MyDataRetrieval().get_dataframe(criteria={‘band_gap’: 0.0},
properties=[‘structure’])
…or this:
- df = MyDataRetrieval().get_dataframe(criteria={‘band_gap’: [0.0, 0.15]},
properties=[“density of states”])
NOT this:
- df = MyDataRetrieval().get_dataframe(criteria={‘query.bg[0] && band_gap’: 0.0},
properties=[‘Struct.page[Value]’])
The implemented DataRetrieval class should handle the conversion from a ‘sensible’ query to a query fit for the individual API and database.
There may be cases where a ‘sensible’ query is not sufficient to define a query to the API; in this case, use the get_dataframe kwargs sparingly to augment the criteria, properties, or form of the underlying API query.
A method for accessing raw DB data with an API-native query may be provided by overriding get_data. The link to the original API documentation must be provided by overriding api_link().
## Documenting a DataRetrieval class
The class documentation for each DataRetrieval class must contain a brief description of the possible data that can be retrieved with the API source. It should also detail the form of the criteria and properties that can be retrieved with the class, and/or should link to a web page showing this information. The options of the class must all be defined in the __init__ function of the class, and we recommend documenting them using the [Google style](https://google.github.io/styleguide/pyguide.html).
-
api_link
()¶ The link to comprehensive API documentation or data source.
- Returns:
(str): A link to the API documentation for this DataRetrieval class.
-
get_dataframe
(criteria, properties, **kwargs)¶ Retrieve a dataframe of properties from the database which satisfy criteria.
- Args:
- criteria (dict): The name of each criterion is the key; the value
or range of the criterion is the value.
- properties (list): Properties to return from the query matching
the criteria. For example, [‘structure’, ‘formula’]
- Returns:
- (pandas DataFrame) The dataframe containing properties as columns
and samples as rows.