matminer.data_retrieval package¶
Submodules¶
matminer.data_retrieval.retrieve_Citrine module¶
-
class
matminer.data_retrieval.retrieve_Citrine.
CitrineDataRetrieval
(api_key=None)¶ Bases:
matminer.data_retrieval.retrieve_base.BaseDataRetrieval
CitrineDataRetrieval is used to retrieve data from the Citrination database See API client docs at api_link below.
-
__init__
(api_key=None)¶ - Args:
- api_key: (str) Your Citrine API key, or None if
- you’ve set the CITRINE_KEY environment variable
-
api_link
()¶ The link to comprehensive API documentation or data source.
- Returns:
- (str): A link to the API documentation for this DataRetrieval class.
-
get_data
(formula=None, prop=None, data_type=None, reference=None, min_measurement=None, max_measurement=None, from_record=None, data_set_id=None, max_results=None)¶ Gets raw api data from Citrine in json format. See api_link for more information on input parameters
- Args:
- formula: (str) filter for the chemical formula field; only those
- results that have chemical formulas that contain this string will be returned
prop: (str) name of the property to search for data_type: (str) ‘EXPERIMENTAL’/’COMPUTATIONAL’/’MACHINE_LEARNING’;
filter for properties obtained from experimental work, computational methods, or machine learning.- reference: (str) filter for the reference field; only those
- results that have contributors that contain this string will be returned
min_measurement: (str/num) minimum of the property value range max_measurement: (str/num) maximum of the property value range from_record: (int) index of first record to return (indexed from 0) data_set_id: (int) id of the particular data set to search on max_results: (int) number of records to limit the results to
Returns: (list) of jsons/pifs returned by Citrine’s API
-
get_dataframe
(criteria, properties=None, common_fields=None, secondary_fields=False, print_properties_options=True)¶ Gets a Pandas dataframe object from data retrieved from the Citrine API.
- Args:
- criteria (dict): see get_data method for supported keys except
- prop; prop should be included in properties.
- properties ([str]): requested properties/fields/columns.
- For example, [“Seebeck coefficient”, “Band gap”]. If unsure about the exact words, capitalization, etc try something like [“gap”] and “max_results”: 3 and print_properties_options=True to see the exact options for this field
- common_fields ([str]): fields that are common to all the requested
- properties. Common example can be “chemicalFormula”. Look for suggested common fields after a quick query for more info
- secondary_fields (bool): if True, fields not included in properties
- may be added to the output (e.g. references). Recommended only if len(properties)==1
- print_properties_options (bool): whether to print available options
- for “properties” and “common_fields” arguments.
Returns: (object) Pandas dataframe object containing the results
-
-
matminer.data_retrieval.retrieve_Citrine.
get_value
(dict_item)¶
-
matminer.data_retrieval.retrieve_Citrine.
parse_scalars
(scalars)¶
matminer.data_retrieval.retrieve_MDF module¶
-
class
matminer.data_retrieval.retrieve_MDF.
MDFDataRetrieval
(anonymous=False, **kwargs)¶ Bases:
matminer.data_retrieval.retrieve_base.BaseDataRetrieval
MDFDataRetrieval is used to retrieve data from the Materials Data Facility database and convert them into a Pandas DataFrame. Note that invocation with full access to MDF will require authentication (see api_link) but an anonymous mode is supported, which can be used with anonymous=True as a keyword arg.
- Examples:
>>>mdf_dr = MDFDataRetrieval(anonymous=True) >>>results = mdf_dr.get_dataframe({“elements”:[“Ag”, “Be”], “source_names”: [“oqmd”]})
>>>results = mdf_dr.get_dataframe({“source_names”: [“oqmd”], >>> “match_ranges”: {“oqmd.band_gap.value”: [4.0, “*”]}})
-
__init__
(anonymous=False, **kwargs)¶ - Args:
- anonymous (bool): whether to use anonymous login (i. e. no
- globus authentication)
- **kwargs: kwargs for Forge, including index (globus search index
- to search on), local_ep, anonymous
-
api_link
()¶ The link to comprehensive API documentation or data source.
- Returns:
- (str): A link to the API documentation for this DataRetrieval class.
-
get_data
(squery, unwind_arrays=True, **kwargs)¶ Gets a dataframe from the MDF API from an explicit string query (rather than input args like get_dataframe).
- Args:
squery (str): String for explicit query unwind_arrays (bool): whether or not to unwind arrays in
flattening docs for dataframe**kwargs: kwargs for query
- Returns:
- dataframe corresponding to query
-
get_dataframe
(criteria, properties=None, unwind_arrays=True)¶ Retrieves data from the MDF API and formats it as a Pandas Dataframe
- Args:
- criteria (dict): options for keys are
source_names ([str]): source names to include, e. g. [“oqmd”] elements ([str]): elements to include, e. g. [“Ag”, “Si”] titles ([str]): titles to include, e. g. [“Coarsening of a
semisolid Al-Cu alloy”]tags ([str]): tags to include, e. g. [“outcar”] resource_types ([str]): resources to include, e. g. [“record”] match_fields ({}): field-value mappings to include, e. g.
{“oqmd.converged”: True}- exclude_fields ({}): field-value mappings to exclude, e. g.
- {“oqmd.converged”: False}
- match_ranges ({}): field-range mappings to include, e. g.
- {“oqmd.band_gap.value”: [1, 5]}, use “*” for no lower or upper bound, e. g. {“oqdm.band_gap.value”: [1, “*”]},
- exclude_ranges ({}): field-range mapping to exclude,
- {“oqmd.band_gap.value”: [3, “*”]} to exclude all results with band gap higher than 3.
- raw (bool): whether or not to return raw (non-dataframe)
- output, defaults to False
- unwind_arrays (bool): whether or not to unwind arrays in
- flattening docs for dataframe
- Returns (pandas.DataFrame):
- DataFrame corresponding to all documents from aggregated query
-
matminer.data_retrieval.retrieve_MDF.
make_dataframe
(docs, unwind_arrays=True)¶ Formats raw docs returned from MDF API search into a dataframe
- Args:
- docs [{}]: list of documents from forge search
- or aggregation
Returns: DataFrame corresponding to formatted docs
matminer.data_retrieval.retrieve_MP module¶
-
class
matminer.data_retrieval.retrieve_MP.
MPDataRetrieval
(api_key=None)¶ Bases:
matminer.data_retrieval.retrieve_base.BaseDataRetrieval
Retrieves data from the Materials Project database.
-
__init__
(api_key=None)¶ - Args:
- api_key: (str) Your Materials Project API key, or None if you’ve
- set up your pymatgen config.
-
api_link
()¶ The link to comprehensive API documentation or data source.
- Returns:
- (str): A link to the API documentation for this DataRetrieval class.
-
get_data
(criteria, properties, mp_decode=False, index_mpid=True)¶ - Args:
- criteria: (str/dict) see MPRester.query() for a description of this
- parameter. String examples: “mp-1234”, “Fe2O3”, “Li-Fe-O’, “*2O3”. Dict example: {“band_gap”: {“$gt”: 1}}
- properties: (list) see MPRester.query() for a description of this
- parameter. Example: [“formula”, “formation_energy_per_atom”]
- mp_decode: (bool) see MPRester.query() for a description of this
- parameter. Whether to decode to a Pymatgen object where possible.
- index_mpid: (bool) Whether to set the materials_id as the dataframe
- index.
- Returns ([dict]):
- a list of jsons that match the criteria and contain properties
-
get_dataframe
(criteria, properties, index_mpid=True, **kwargs)¶ Gets data from MP in a dataframe format. See api_link for more details.
- Args:
- all arguments including criteria, properties and index_mpid are the same as in get_data
Returns (pandas.Dataframe):
-
matminer.data_retrieval.retrieve_MPDS module¶
matminer.data_retrieval.retrieve_MongoDB module¶
-
class
matminer.data_retrieval.retrieve_MongoDB.
MongoDataRetrieval
(coll)¶ Bases:
matminer.data_retrieval.retrieve_base.BaseDataRetrieval
-
__init__
(coll)¶ Retrieves data from a MongoDB collection to a pandas.Dataframe object
- Args:
- coll: A MongoDB collection object
-
api_link
()¶ The link to comprehensive API documentation or data source.
- Returns:
- (str): A link to the API documentation for this DataRetrieval class.
-
get_dataframe
(criteria, properties=None, limit=0, sort=None, idx_field=None, strict=False)¶ - Args:
criteria: (dict) - a pymongo-style query to filter data records properties: ([str] or None) - a list of str fields to retrieve;
dot-notation is allowed (e.g. “structure.lattice.a”). Set to “None” to try to auto-detect the fields.limit: (int) - max number of entries. 0 means no limit sort: (tuple) - pymongo-style sort option idx_field: (str) - name of field to use as index (must have unique
entries)strict: (bool) - if False, replaces missing values with NaN
Returns (pandas.DataFrame):
-
-
matminer.data_retrieval.retrieve_MongoDB.
clean_projection
(projection)¶ Projecting on e.g. ‘a.b.’ and ‘a’ is disallowed in MongoDb, so project inclusively. See unit tests for examples of what this is doing.
- Args:
- projection: (list) - list of fields to retrieve; dot-notation is allowed.
-
matminer.data_retrieval.retrieve_MongoDB.
is_int
(x)¶
-
matminer.data_retrieval.retrieve_MongoDB.
remove_ints
(projection)¶ Transforms a string like “a.1.x” to “a.x” - for Mongo projection purposes
- Args:
- projection: (str) the projection to remove ints from
Returns (str)
matminer.data_retrieval.retrieve_base module¶
-
class
matminer.data_retrieval.retrieve_base.
BaseDataRetrieval
¶ Bases:
object
Abstract class to retrieve data from various material APIs while adhering to a quasi-standard format for querying.
## Implementing a new DataRetrieval class
If you have an API which you’d like to incorporate into matminer’s data retrieval tools, using BaseDataRetrieval is the preferred way of doing so. All DataRetrieval classes should subclass BaseDataRetrieval and implement the following:
- get_dataframe()
- api_link()
Retrieving data should be done by the user with get_dataframe. Criteria should be a dictionary which will be used to form a query to the database. Properties should be a list which defines the columns that will be returned. While the ‘criteria’ and ‘properties’ arguments may have different valid values depending on the database, they should always have sensible formats and names if possible. For example, the user should be calling this:
- df = MyDataRetrieval().get_dataframe(criteria={‘band_gap’: 0.0},
- properties=[‘structure’])
…or this:
- df = MyDataRetrieval().get_dataframe(criteria={‘band_gap’: [0.0, 0.15]},
- properties=[“density of states”])
NOT this:
- df = MyDataRetrieval().get_dataframe(criteria={‘query.bg[0] && band_gap’: 0.0},
- properties=[‘Struct.page[Value]’])
The implemented DataRetrieval class should handle the conversion from a ‘sensible’ query to a query fit for the individual API and database.
There may be cases where a ‘sensible’ query is not sufficient to define a query to the API; in this case, use the get_dataframe kwargs sparingly to augment the criteria, properties, or form of the underlying API query.
A method for accessing raw DB data with an API-native query may be provided by overriding get_data. The link to the original API documentation must be provided by overriding api_link().
## Documenting a DataRetrieval class
The class documentation for each DataRetrieval class must contain a brief description of the possible data that can be retrieved with the API source. It should also detail the form of the criteria and properties that can be retrieved with the class, and/or should link to a web page showing this information. The options of the class must all be defined in the __init__ function of the class, and we recommend documenting them using the [Google style](https://google.github.io/styleguide/pyguide.html).
-
api_link
()¶ The link to comprehensive API documentation or data source.
- Returns:
- (str): A link to the API documentation for this DataRetrieval class.
-
get_dataframe
(criteria, properties, **kwargs)¶ Retrieve a dataframe of properties from the database which satisfy criteria.
- Args:
- criteria (dict): The name of each criterion is the key; the value
- or range of the criterion is the value.
- properties (list): Properties to return from the query matching
- the criteria. For example, [‘structure’, ‘formula’]
- Returns:
- (pandas DataFrame) The dataframe containing properties as columns
- and samples as rows.