General Functions¶
- pyscisci.utils.download_file_from_google_drive(file_id, destination=None, CHUNK_SIZE=32768)¶
Download data files from the google Drive.
Modified from: from https://stackoverflow.com/questions/38511444/python-download-files-from-google-drive-using-url
- pyscisci.utils.empty_mode(a)¶
Mode of array when empty
- Parameters:
a (numpy array or list) – array of values
- pyscisci.utils.groupby_count(df, colgroupby, colcountby, count_unique=True, show_progress=False)¶
Group the DataFrame and count the number for each group.
- Parameters:
df (DataFrame) – The DataFrame.
colgroupby (str) – The column to groupby.
colcountby (str) – The column to count.
count_unique (bool, default True) – If True, count unique items in the rows. If False, just return the number of rows.
show_progress (bool or str, default False) – If True, display a progress bar for the count. If str, the name of the progress bar to display.
- Returns:
DataFrame with two columns: colgroupby, colcountby+`Count`
- Return type:
DataFrame
- pyscisci.utils.groupby_mean(df, colgroupby, colcountby, show_progress=False)¶
Group the DataFrame and find the mean of the column.
- Parameters:
df (DataFrame) – The DataFrame.
colgroupby (str) – The column to groupby.
colcountby (str) – The column to find the mean of values.
show_progress (bool or str, default False) – If True, display a progress bar for the summation. If str, the name of the progress bar to display.
- Returns:
DataFrame with two columns: colgroupby, colcountby+’Mean’
- Return type:
DataFrame
- pyscisci.utils.groupby_range(df, colgroupby, colrange, show_progress=False)¶
Group the DataFrame and find the range between the smallest and largest value for each group.
- df: DataFrame
The DataFrame.
- colgroupby: str
The column to groupby.
- colrange: str
The column to find the range of values.
- show_progress: bool or str, default False
If True, display a progress bar for the range. If str, the name of the progress bar to display.
- DataFrame
DataFrame with two columns: colgroupby, colrange+`Range`
- pyscisci.utils.groupby_total(df, colgroupby, colcountby, show_progress=False)¶
Group the DataFrame and find the total of the column.
- Parameters:
df (DataFrame) – The DataFrame.
colgroupby (str) – The column to groupby.
colcountby (str) – The column to find the total of values.
show_progress (bool or str, default False) – If True, display a progress bar for the summation. If str, the name of the progress bar to display.
- Returns:
DataFrame with two columns: colgroupby, colcountby+’Total’
- Return type:
DataFrame
- pyscisci.utils.groupby_zero_col(df, colgroupby, colrange, show_progress=False)¶
Group the DataFrame and shift the column so the minimum value is 0.
- Parameters:
df (DataFrame) – The DataFrame.
colgroupby (str) – The column to groupby.
colrange (str) – The column to find the range of values.
show_progress (bool or str, default False) – If True, display a progress bar. If str, the name of the progress bar to display.
- Returns:
DataFrame with two columns: colgroupby, colrange
- Return type:
DataFrame
- pyscisci.utils.holder_mean(a, rho=1)¶
Holder mean
- Parameters:
a (numpy array or list) – array of values
rho (float) – holder parameter arithmetic mean (rho=1) geometric mean (rho=0) harmonic mean (rho=-1) quadratic mean (rho=2) max (rho-> infty) min (rho-> -infty)
- pyscisci.utils.isin_range(values2check, min_value, max_value)¶
Check if the values2check are in the inclusive range [min_value, max_value].
- Parameters:
values2check (numpy array) – The values to check.
min_value (float) – The lowest value of the range.
max_value (float) – The highest value of the range.
show_progress (bool or str, default False) – If True, display a progress bar for the count. If str, the name of the progress bar to display.
- Returns:
True if the value is in the range.
- Return type:
Numpy Array
- pyscisci.utils.isin_sorted(values2check, masterlist)¶
Check if the values2check are in the sorted masterlist.
- Parameters:
values2check (numpy array) – The values to check.
masterlist (numpy array) – The sorted list of master values.
- Returns:
True if the value is in the masterlist.
- Return type:
Numpy Array
- pyscisci.utils.jenson_shannon(p, q)¶
Jensen–Shannon divergence
- pyscisci.utils.kl(p, q)¶
Kullback–Leibler divergence (KL-divergence)
- pyscisci.utils.rank_array(a, ascending=True, normed=False)¶
Rank elements in the array. ascending=> lowest=0, highest=1 descending=> lowest=1, highest=0
- Parameters:
a (numpy array or list) – Object to rank
ascending (bool, default True) – Sort ascending vs. descending.
normed (bool, default False) – False : rank is from 0 to N -1 True : rank is from 0 to 1
- Return type:
Ranked array.
- pyscisci.utils.shannon_entropy(p, base=2.718281828459045)¶
Shannon entropy
- Parameters:
p (numpy array or list) – array of values
- pyscisci.utils.simpson(p)¶
Simpson diversity
- Parameters:
p (numpy array or list) – array of values
- pyscisci.utils.simpson_finite(p)¶
Simpson diversity with finite size correction
- Parameters:
p (numpy array or list) – array of values
- pyscisci.utils.uniquemap_by_frequency(df, colgroupby='PublicationId', colcountby='FieldId', ascending=False)¶
Reduce a one-to-many mapping to a selection based on frequency of occurence in the dataframe (either to the largest, most common or smallest, least common).
- Parameters:
ascending (bool, default False) – False: larger counts dominate–map defaults to the most common True: smaller counts dominate–map defaults to the least common
- pyscisci.utils.value_to_int(a, sort_values='value', ascending=False, return_map=True)¶
Map the values of an array to integers.
- Parameters:
a (numpy array or list) – array of values
sort_values (str, default 'value') – ‘none’ : dont sort ‘value’ : sort the array items based on their value ‘freq’ : sort the array items based on their frequency (see ascending)
ascending (bool, default False) –
- Only when sort_values == ‘freq’:
False: larger counts dominate–map 0 to the most common True: smaller counts dominate–map 0 to the least common
return_map (bool, default True) –
- pyscisci.utils.welford_mean_m2(previous_count, previous_mean, previous_m2, new_value)¶
Welford’s algorithm for online mean and variance.