In [1]:
import sys; sys.path.append(_dh[0].split("knowknow")[0])
from knowknow import *
In [2]:
showdocs("top1")

Zooming in on the top 1%

I would like to look at the most successful cited authors, cited works, and cited terms. Unfortunately, this isn't so simple. There has been a dramatic increase in the supply of citations over the last 100 years, so the group with the most total citations would be skewed towards the citation preferences of recent papers. In order to account for this bias, I choose among items cited by articles published in each decade 1940-1950, 1941-1951, 1942-1952, all the way to 1980-1990. In each of these decades I determine which were the top-cited 1%. The set of all these top 1%s, from all these decade spans, comprise the 1% I will study in this paper.

User Parameters

Just pick the database database_name and the type of count atom you want to analyze (e.g. "ta" for cited author, "c" for cited work, etc.)

Note that "t" is only available for jstor databases

In [3]:
database_name = 'sociology-wos'
ctype = 'ta'
top_percentile = 0.01
In [4]:
# Parameters
database_name = "sociology-wos"
ctype = "fa"

Load data

In [5]:
cysum = load_variable("%s.%s.ysum" % (database_name,ctype))
cits = get_cnt("%s.doc" % database_name, [comb(ctype,'fy')])
Loaded keys: dict_keys(['fa.fy'])
Available keys: ['a', 'c', 'c.c', 'c.fj', 'c.fy', 'c.fy.j', 'fa', 'fa.c', 'fa.fj', 'fa.fj.fy', 'fa.fy', 'fj', 'fj.fy', 'fj.ta', 'fj.ty', 'fy', 'fy.ta', 'fy.ty', 'ta', 'ty', 'ty.ty']
In [6]:
any("-" in x for x in cysum)
Out[6]:
False

loop through all the decades!

In [7]:
all_tops = set()

print("%s total entries" % len(cysum))


# ranges loop from 1940-1950 to 1980-1990, in 1-year increments
for RANGE_START, RANGE_END in zip( 
    range(1940,1980+1,1),
    range(1950,1990+1,1),
):
    
    # create a copy of cysum
    cysum_copy = {k:dict(v) for k,v in cysum.items()}

    count_in_range = defaultdict(int)
    for cross, count in cits[comb(ctype,'fy')].items():
        if RANGE_END >= cross.fy >= RANGE_START:
            count_in_range[ getattr(cross, ctype) ] += count
            
    counts = list(count_in_range.values())
    if not len(counts):
        print("Skipping %s" % RANGE_START)
        continue
        
    q99 = np.quantile(np.array( counts ), 1-top_percentile)
    top1 = {k for k in count_in_range if count_in_range[k]>=q99}
    all_tops.update(top1)
    
    print("%s /%s in the top %0.1f%% in %s,%s (%s total accumulated)" % (
        len(top1),
        len(count_in_range),
        top_percentile*100,
        RANGE_START, RANGE_END,
        len(all_tops)
    ))

    
alldf = pd.DataFrame.from_records([
    c
    for name, c in cysum.items()
    if name in all_tops
])

alldf.fillna(value=np.nan, inplace=True)

print(alldf.shape)
5376 total entries
7 /560 in the top 1.0% in 1940,1950 (7 total accumulated)
8 /628 in the top 1.0% in 1941,1951 (9 total accumulated)
10 /699 in the top 1.0% in 1942,1952 (12 total accumulated)
8 /766 in the top 1.0% in 1943,1953 (14 total accumulated)
12 /831 in the top 1.0% in 1944,1954 (18 total accumulated)
12 /906 in the top 1.0% in 1945,1955 (21 total accumulated)
13 /968 in the top 1.0% in 1946,1956 (23 total accumulated)
14 /1065 in the top 1.0% in 1947,1957 (25 total accumulated)
15 /1153 in the top 1.0% in 1948,1958 (27 total accumulated)
20 /1250 in the top 1.0% in 1949,1959 (32 total accumulated)
24 /1362 in the top 1.0% in 1950,1960 (37 total accumulated)
16 /1488 in the top 1.0% in 1951,1961 (37 total accumulated)
19 /1580 in the top 1.0% in 1952,1962 (42 total accumulated)
19 /1683 in the top 1.0% in 1953,1963 (45 total accumulated)
22 /1801 in the top 1.0% in 1954,1964 (49 total accumulated)
34 /1895 in the top 1.0% in 1955,1965 (60 total accumulated)
22 /2133 in the top 1.0% in 1956,1966 (60 total accumulated)
24 /2300 in the top 1.0% in 1957,1967 (63 total accumulated)
24 /2382 in the top 1.0% in 1958,1968 (65 total accumulated)
34 /2683 in the top 1.0% in 1959,1969 (74 total accumulated)
36 /2982 in the top 1.0% in 1960,1970 (80 total accumulated)
38 /3333 in the top 1.0% in 1961,1971 (87 total accumulated)
41 /3616 in the top 1.0% in 1962,1972 (90 total accumulated)
47 /3898 in the top 1.0% in 1963,1973 (98 total accumulated)
78 /4254 in the top 1.0% in 1964,1974 (128 total accumulated)
52 /4635 in the top 1.0% in 1965,1975 (130 total accumulated)
54 /5083 in the top 1.0% in 1966,1976 (141 total accumulated)
61 /5493 in the top 1.0% in 1967,1977 (153 total accumulated)
84 /6021 in the top 1.0% in 1968,1978 (174 total accumulated)
69 /6525 in the top 1.0% in 1969,1979 (182 total accumulated)
82 /7009 in the top 1.0% in 1970,1980 (192 total accumulated)
89 /7381 in the top 1.0% in 1971,1981 (203 total accumulated)
98 /7779 in the top 1.0% in 1972,1982 (215 total accumulated)
105 /8169 in the top 1.0% in 1973,1983 (231 total accumulated)
119 /8580 in the top 1.0% in 1974,1984 (252 total accumulated)
90 /8925 in the top 1.0% in 1975,1985 (256 total accumulated)
99 /9276 in the top 1.0% in 1976,1986 (268 total accumulated)
108 /9629 in the top 1.0% in 1977,1987 (280 total accumulated)
105 /9985 in the top 1.0% in 1978,1988 (289 total accumulated)
114 /10316 in the top 1.0% in 1979,1989 (309 total accumulated)
116 /10630 in the top 1.0% in 1980,1990 (321 total accumulated)
(222, 42)
In [8]:
alldf.shape
Out[8]:
(222, 42)
In [9]:
alldf.sort_values("total", ascending=False).head()
Out[9]:
rebirth_5_6 rebirth_2_20 maxcount rebirth_5_4 rebirth_5_5 total rebirth_5_8 rebirth_1_10 death_1 rebirth_2_5 ... maxpropy totalprop rebirth_2_10 rebirth_5_0 rebirth_1_20 rebirth_5_9 rebirth_0_20 first rebirth_5_7 rebirth_1_3
4 NaN NaN 5 NaN NaN 89 NaN NaN NaN NaN ... 1966 0.079724 NaN NaN NaN NaN NaN 1965 NaN NaN
8 NaN NaN 5 NaN NaN 86 NaN NaN NaN NaN ... 1979 0.061659 NaN NaN NaN NaN NaN 1973 NaN NaN
2 NaN NaN 4 NaN NaN 72 NaN NaN NaN NaN ... 1991 0.046492 NaN NaN NaN NaN NaN 1979 NaN NaN
28 NaN NaN 7 NaN NaN 68 NaN NaN NaN NaN ... 1973 0.042180 NaN NaN NaN NaN NaN 1972 NaN NaN
80 NaN NaN 4 NaN NaN 65 NaN NaN NaN NaN ... 1976 0.053456 NaN NaN NaN NaN NaN 1966 NaN NaN

5 rows × 42 columns

In [10]:
save_variable("%s.%s.top1" % (database_name,ctype), alldf)
In [11]:
save_variable("%s.%s.top1" % (database_name,ctype), alldf)