General Functions

pyscisci.utils.download_file_from_google_drive(file_id, destination=None, CHUNK_SIZE=32768)

Download data files from the google Drive.

Modified from: from https://stackoverflow.com/questions/38511444/python-download-files-from-google-drive-using-url

pyscisci.utils.empty_mode(a)

Mode of array when empty

Parameters:

a (numpy array or list) – array of values

pyscisci.utils.groupby_count(df, colgroupby, colcountby, count_unique=True, show_progress=False)

Group the DataFrame and count the number for each group.

Parameters:
  • df (DataFrame) – The DataFrame.

  • colgroupby (str) – The column to groupby.

  • colcountby (str) – The column to count.

  • count_unique (bool, default True) – If True, count unique items in the rows. If False, just return the number of rows.

  • show_progress (bool or str, default False) – If True, display a progress bar for the count. If str, the name of the progress bar to display.

Returns:

DataFrame with two columns: colgroupby, colcountby+`Count`

Return type:

DataFrame

pyscisci.utils.groupby_mean(df, colgroupby, colcountby, show_progress=False)

Group the DataFrame and find the mean of the column.

Parameters:
  • df (DataFrame) – The DataFrame.

  • colgroupby (str) – The column to groupby.

  • colcountby (str) – The column to find the mean of values.

  • show_progress (bool or str, default False) – If True, display a progress bar for the summation. If str, the name of the progress bar to display.

Returns:

DataFrame with two columns: colgroupby, colcountby+’Mean’

Return type:

DataFrame

pyscisci.utils.groupby_range(df, colgroupby, colrange, show_progress=False)

Group the DataFrame and find the range between the smallest and largest value for each group.

df: DataFrame

The DataFrame.

colgroupby: str

The column to groupby.

colrange: str

The column to find the range of values.

show_progress: bool or str, default False

If True, display a progress bar for the range. If str, the name of the progress bar to display.

DataFrame

DataFrame with two columns: colgroupby, colrange+`Range`

pyscisci.utils.groupby_total(df, colgroupby, colcountby, show_progress=False)

Group the DataFrame and find the total of the column.

Parameters:
  • df (DataFrame) – The DataFrame.

  • colgroupby (str) – The column to groupby.

  • colcountby (str) – The column to find the total of values.

  • show_progress (bool or str, default False) – If True, display a progress bar for the summation. If str, the name of the progress bar to display.

Returns:

DataFrame with two columns: colgroupby, colcountby+’Total’

Return type:

DataFrame

pyscisci.utils.groupby_zero_col(df, colgroupby, colrange, show_progress=False)

Group the DataFrame and shift the column so the minimum value is 0.

Parameters:
  • df (DataFrame) – The DataFrame.

  • colgroupby (str) – The column to groupby.

  • colrange (str) – The column to find the range of values.

  • show_progress (bool or str, default False) – If True, display a progress bar. If str, the name of the progress bar to display.

Returns:

DataFrame with two columns: colgroupby, colrange

Return type:

DataFrame

pyscisci.utils.holder_mean(a, rho=1)

Holder mean

Parameters:
  • a (numpy array or list) – array of values

  • rho (float) – holder parameter arithmetic mean (rho=1) geometric mean (rho=0) harmonic mean (rho=-1) quadratic mean (rho=2) max (rho-> infty) min (rho-> -infty)

pyscisci.utils.isin_range(values2check, min_value, max_value)

Check if the values2check are in the inclusive range [min_value, max_value].

Parameters:
  • values2check (numpy array) – The values to check.

  • min_value (float) – The lowest value of the range.

  • max_value (float) – The highest value of the range.

  • show_progress (bool or str, default False) – If True, display a progress bar for the count. If str, the name of the progress bar to display.

Returns:

True if the value is in the range.

Return type:

Numpy Array

pyscisci.utils.isin_sorted(values2check, masterlist)

Check if the values2check are in the sorted masterlist.

Parameters:
  • values2check (numpy array) – The values to check.

  • masterlist (numpy array) – The sorted list of master values.

Returns:

True if the value is in the masterlist.

Return type:

Numpy Array

pyscisci.utils.jenson_shannon(p, q)

Jensen–Shannon divergence

pyscisci.utils.kl(p, q)

Kullback–Leibler divergence (KL-divergence)

pyscisci.utils.rank_array(a, ascending=True, normed=False)

Rank elements in the array. ascending=> lowest=0, highest=1 descending=> lowest=1, highest=0

Parameters:
  • a (numpy array or list) – Object to rank

  • ascending (bool, default True) – Sort ascending vs. descending.

  • normed (bool, default False) – False : rank is from 0 to N -1 True : rank is from 0 to 1

Return type:

Ranked array.

pyscisci.utils.shannon_entropy(p, base=2.718281828459045)

Shannon entropy

Parameters:

p (numpy array or list) – array of values

pyscisci.utils.simpson(p)

Simpson diversity

Parameters:

p (numpy array or list) – array of values

pyscisci.utils.simpson_finite(p)

Simpson diversity with finite size correction

Parameters:

p (numpy array or list) – array of values

pyscisci.utils.uniquemap_by_frequency(df, colgroupby='PublicationId', colcountby='FieldId', ascending=False)

Reduce a one-to-many mapping to a selection based on frequency of occurence in the dataframe (either to the largest, most common or smallest, least common).

Parameters:

ascending (bool, default False) – False: larger counts dominate–map defaults to the most common True: smaller counts dominate–map defaults to the least common

pyscisci.utils.value_to_int(a, sort_values='value', ascending=False, return_map=True)

Map the values of an array to integers.

Parameters:
  • a (numpy array or list) – array of values

  • sort_values (str, default 'value') – ‘none’ : dont sort ‘value’ : sort the array items based on their value ‘freq’ : sort the array items based on their frequency (see ascending)

  • ascending (bool, default False) –

    Only when sort_values == ‘freq’:

    False: larger counts dominate–map 0 to the most common True: smaller counts dominate–map 0 to the least common

  • return_map (bool, default True) –

pyscisci.utils.welford_mean_m2(previous_count, previous_mean, previous_m2, new_value)

Welford’s algorithm for online mean and variance.