Bibliometrics
pySciSci facilitates the analysis of publications, authors, citations and as well as citation time-series, fixed time window citation analysis, and citation count normalization by year and field.
Publications and Citations
The pySciSci package facilitates the analysis of interrelationships between publications as captured by references and citations.
For example, the most common measure of scientific impact is the citation count, or the number of times a publication has been referenced by other publications. Variations also include citation time-series, fixed time window citation analysis, citation count normalization by year and field, and citation ranks. More advanced methods fit models to citation timeseries, such as in the prediction of the long-term citation counts to a publication [3], or in the assignment of the sleeping beauty score []. The package also removes of self-citations occurring between publications by the same author.
More advanced metrics capture the diversity in the citation interrelationships between publications. These measures include the Rao-Stirling reference interdisciplinary [], novelty & conventionality [4], and the disruption index [], [].
-
pyscisci.methods.publication.citation_rank(df, colgroupby='Year', colrankby='C10', ascending=True, normed=False, show_progress=False)
Rank publications by the number of citations (smallest) to N -1 (largest)
- Parameters:
df (DataFrame) – A DataFrame with the citation information for each Publication.
colgroupby (str, list) – The DataFrame column(s) to subset by.
colrankby (str) – The DataFrame column to rank by.
ascending (bool, default True) – Sort ascending vs. descending.
normed (bool, default False) –
show_progress (bool, default False) – If True, show a progress bar tracking the calculation.
- Returns:
The original dataframe with a new column for rank: colrankby+”Rank”
- Return type:
DataFrame
-
pyscisci.methods.publication.publication_beauty(pub2ref, colgroupby='CitedPublicationId', colcountby='CitingPublicationId', show_progress=False)
Calculate the sleeping beauty and awakening time for each cited publication. See [] for the derivation.
The algorithmic implementation can be found in metrics.qfactor()
.
- Parameters:
pub2ref (DataFrame, default None, Optional) – A DataFrame with the temporal citing information information.
colgroupby (str, default 'CitedPublicationId', Optional) – The DataFrame column with Author Ids. If None then the database ‘CitedPublicationId’ is used.
colcountby (str, default 'CitingPublicationId', Optional) – The DataFrame column with Citation counts for each publication. If None then the database ‘CitingPublicationId’ is used.
- Returns:
Trajectory DataFrame with 2 columns: ‘AuthorId’, ‘Hindex’
- Return type:
DataFrame
Author-centric Methods
The sociology of science has analyzed scientific careers in terms of individual incentives, productivity, competition, collaboration, and success. The pySciSci package facilitates author career analysis through both aggregate career statistics and temporal career trajectories. Highlights include the H-index [5], Q-factor [], yearly productivity trajectories [6], collective credit assignment [], and hot-hand effect [].
-
pyscisci.methods.author.author_career_length(pub2author=None, colgroupby='AuthorId', datecol='Year', show_progress=False)
Calculate the career length for each author. The career length is the length of time from the first
publication to the last publication.
- Parameters:
pub2author (DataFrame, default None, Optional) – A DataFrame with the author2publication information.
colgroupby (str, default 'AuthorId', Optional) – The DataFrame column with Author Ids. If None then the database ‘AuthorId’ is used.
datecol (str, default 'Year', Optional) – The DataFrame column with Date information. If None then the database ‘Year’ is used.
- Returns:
Productivity DataFrame with 2 columns: ‘AuthorId’, ‘CareerLength’
- Return type:
DataFrame
-
pyscisci.methods.author.author_cindex(pub2author, impact=None, colgroupby='AuthorId', colcountby='Ctotal', show_progress=False)
Calculate the author c-index. See [7] for the derivation.
The number of citations for an author’s most cited work.
- Parameters:
df (DataFrame, default None, Optional) – A DataFrame with the author2publication information. If None then the database ‘author2pub’ is used.
colgroupby (str, default 'AuthorId', Optional) – The DataFrame column with Author Ids. If None then the database ‘AuthorId’ is used.
colcountby (str, default 'Ctotal', Optional) – The DataFrame column with Citation counts for each publication. If None then the database ‘Ctotal’ is used.
- Returns:
Trajectory DataFrame with 2 columns: ‘AuthorId’, ‘Hindex’
- Return type:
DataFrame
-
pyscisci.methods.author.author_endyear(pub2author=None, colgroupby='AuthorId', datecol='Year', show_progress=False)
Calculate the year of last publication for each author.
- Parameters:
pub2author (DataFrame, default None, Optional) – A DataFrame with the author2publication information.
colgroupby (str, default 'AuthorId', Optional) – The DataFrame column with Author Ids. If None then the database ‘AuthorId’ is used.
datecol (str, default 'Year', Optional) – The DataFrame column with Date information. If None then the database ‘Year’ is used.
- Returns:
Productivity DataFrame with 2 columns: ‘AuthorId’, ‘CareerLength’
- Return type:
DataFrame
-
pyscisci.methods.author.author_gindex(pub2author, impact=None, colgroupby='AuthorId', colcountby='Ctotal', show_progress=False)
Calculate the author g-index. See [5] for the derivation.
The algorithmic implementation can be found in metrics.hindex()
.
- Parameters:
df (DataFrame, default None, Optional) – A DataFrame with the author2publication information. If None then the database ‘author2pub’ is used.
colgroupby (str, default 'AuthorId', Optional) – The DataFrame column with Author Ids. If None then the database ‘AuthorId’ is used.
colcountby (str, default 'Ctotal', Optional) – The DataFrame column with Citation counts for each publication. If None then the database ‘Ctotal’ is used.
- Returns:
Trajectory DataFrame with 2 columns: ‘AuthorId’, ‘Hindex’
- Return type:
DataFrame
-
pyscisci.methods.author.author_hindex(pub2author, impact=None, colgroupby='AuthorId', colcountby='Ctotal', show_progress=False)
Calculate the author yearly productivity trajectory. See [5] for the derivation.
The algorithmic implementation can be found in metrics.hindex()
.
- Parameters:
df (DataFrame, default None, Optional) – A DataFrame with the author2publication information. If None then the database ‘author2pub’ is used.
colgroupby (str, default 'AuthorId', Optional) – The DataFrame column with Author Ids. If None then the database ‘AuthorId’ is used.
colcountby (str, default 'Ctotal', Optional) – The DataFrame column with Citation counts for each publication. If None then the database ‘Ctotal’ is used.
- Returns:
Trajectory DataFrame with 2 columns: ‘AuthorId’, ‘Hindex’
- Return type:
DataFrame
-
pyscisci.methods.author.author_hotstreak(pub2author, colgroupby='AuthorId', citecol='c10', datecol='Year', maxk=1, l1_lambda=1.0, show_progress=False)
Identify hot streaks in author careers :cite:`liu2018hotstreak’.
TODO: this is an interger programming problem. Reimplement using an interger solver.
Right now just using a brut force search (very inefficient)!
- Parameters:
pub2author (DataFrame) – The author publication history for all authors.
colgroupby (str, default 'AuthorId') – The column with Author information.
citecol (str, default 'c10') – The column with publication citation information.
datecol (str, default 'Year') – The column with publication date/year information.
max_k (int, default 1) – The maximum number of hot streaks to search for in a career. Should be 1 or 2.
l1_lambda (float, default 1.0) – The l1 regularization for the number of streaks.
Note, the authors never define the value they used for this in the SI.
- Returns:
lsm_err (float) – The least square mean error of the model plus the l1-regularized term for the number of model coefficients.
streak_loc (array) – The index locations for the hot streak start and end locations.
-
pyscisci.methods.author.author_productivity(pub2author=None, colgroupby='AuthorId', colcountby='PublicationId', show_progress=False)
Calculate the total number of publications for each author.
- Parameters:
pub2author (DataFrame, default None, Optional) – A DataFrame with the author2publication information.
colgroupby (str, default 'AuthorId', Optional) – The DataFrame column with Author Ids. If None then the database ‘AuthorId’ is used.
colcountby (str, default 'PublicationId', Optional) – The DataFrame column with Publication Ids. If None then the database ‘PublicationId’ is used.
- Returns:
Productivity DataFrame with 2 columns: ‘AuthorId’, ‘Productivity’
- Return type:
DataFrame
-
pyscisci.methods.author.author_productivity_trajectory(pub2author, colgroupby='AuthorId', datecol='Year', colcountby='PublicationId', show_progress=False)
Calculate the author yearly productivity trajectory. See [6]
The algorithmic implementation can be found in metrics.compute_yearly_productivity_traj()
.
- Parameters:
pub2author (DataFrame, default None) – A DataFrame with the author2publication information.
colgroupby (str, default 'AuthorId') – The DataFrame column with Author Ids. If None then the database ‘AuthorId’ is used.
datecol (str, default 'Year') – The DataFrame column with Date information. If None then the database ‘Year’ is used.
colcountby (str, default 'PublicationId') – The DataFrame column with Publication Ids. If None then the database ‘PublicationId’ is used.
- Returns:
Trajectory DataFrame with 5 columns: ‘AuthorId’, ‘t_break’, ‘b’, ‘m1’, ‘m2’
- Return type:
DataFrame
-
pyscisci.methods.author.author_qfactor(pub2author, impact=None, colgroupby='AuthorId', colcountby='Ctotal', show_progress=False)
Calculate the author yearly productivity trajectory. See [] for the derivation.
The algorithmic implementation can be found in metrics.qfactor()
.
- Parameters:
df (DataFrame, default None, Optional) – A DataFrame with the author2publication information. If None then the database ‘author2pub’ is used.
colgroupby (str, default 'AuthorId', Optional) – The DataFrame column with Author Ids. If None then the database ‘AuthorId’ is used.
colcountby (str, default 'Ctotal', Optional) – The DataFrame column with Citation counts for each publication. If None then the database ‘Ctotal’ is used.
- Returns:
Trajectory DataFrame with 2 columns: ‘AuthorId’, ‘Hindex’
- Return type:
DataFrame
-
pyscisci.methods.author.author_startyear(pub2author=None, colgroupby='AuthorId', datecol='Year', show_progress=False)
Calculate the year of first publication for each author.
- Parameters:
pub2author (DataFrame, default None, Optional) – A DataFrame with the author2publication information.
colgroupby (str, default 'AuthorId', Optional) – The DataFrame column with Author Ids. If None then the database ‘AuthorId’ is used.
datecol (str, default 'Year', Optional) – The DataFrame column with Date information. If None then the database ‘Year’ is used.
- Returns:
Productivity DataFrame with 2 columns: ‘AuthorId’, ‘CareerLength’
- Return type:
DataFrame
-
pyscisci.methods.author.author_top_field(pub2author, colgroupby='AuthorId', colcountby='FieldId', fractional_field_counts=False, show_progress=False)
Calculate the most frequent field in the authors career.
- Parameters:
pub2author (DataFrame) – A DataFrame with the author2publication field information.
colgroupby (str, default 'AuthorId') – The DataFrame column with Author Ids. If None then the database ‘AuthorId’ is used.
colcountby (str, default 'FieldId') – The DataFrame column with Citation counts for each publication. If None then the database ‘FieldId’ is used.
fractional_field_counts (bool, default False) –
- How to count publications that are assigned to multiple fields:
If False, each publication-field assignment is counted once.
If True, each publication is counted once, contributing 1/#fields to each field.
- Returns:
DataFrame with 2 columns: ‘AuthorId’, ‘TopFieldId’
- Return type:
DataFrame
-
pyscisci.methods.author.author_yearly_productivity(pub2author=None, colgroupby='AuthorId', datecol='Year', colcountby='PublicationId', show_progress=False)
Calculate the number of publications for each author in each year.
- Parameters:
pub2author (DataFrame, default None, Optional) – A DataFrame with the author2publication information.
colgroupby (str, default 'AuthorId', Optional) – The DataFrame column with Author Ids. If None then the database ‘AuthorId’ is used.
datecol (str, default 'Year', Optional) – The DataFrame column with Year information. If None then the database ‘Year’ is used.
colcountby (str, default 'PublicationId', Optional) – The DataFrame column with Publication Ids. If None then the database ‘PublicationId’ is used.
- Returns:
Productivity DataFrame with 3 columns: ‘AuthorId’, ‘Year’, ‘YearlyProductivity’
- Return type:
DataFrame
-
exception pyscisci.methods.author.pySciSciMetricError
Base Class for metric errors.