In [64]:
import sys; sys.path.append(_dh[0].split("knowknow")[0])
from knowknow import *
In [2]:
showdocs("counter")

Counting coocurrences

Cultural phenomena are rich in meaning and context. Moreover, the meaning and context are what we care about, so stripping that would be a disservice. "Consider Geertz:"

Not only is the semantic structure of the figure a good deal more complex than it appears on the surface, but an analysis of that structure forces one into tracing a multiplicity of referential connections between it and social reality, so that the final picture is one of a configuration of dissimilar meanings out of whose interworking both the expressive power and the rhetorical force of the final symbol derive. (Geertz [1955] 1973, Chapter 8 Ideology as a Cultural System, p. 213)

The way people understanding their world shape their action, and understandings are heterogeneous in any community, woven into a complex web of interacting pieces and parts. Understandings are constantly evolving, shifting with every conversation or Breaking News. Any quantitative technique for studying meaning must be able to capture the relational structure of cultural objects, their temporal dynamics, or it cannot be meaning.

These considerations motivate how I have designed the data structure and code for this project. My attention to "cooccurrences" in what follows is an application of Levi Martin and Lee's (2018) formal approach to meaning. They develop the symbolic formalism I use below, as well as showing several general analytic strategies for inductive, ground-up meaning-making from count data. This approach is quite general, useful for many applications.

The process is rather simple, I count cooccurrences between various attributes. For each document, for each citation in that document, I increment a dozen counters, depending on attributes of the citation, paper, journal, or author. This counting process is done once, and can be used as a compressed form of the dataset for all further analyses. In the terminology of Levi Martin and Lee, I am constructing "hypergraphs", and I will use their notation in what follows. For example $[c*fy]$ indicates the dataset which maps from $(c, fy) \to count$. $c$ is the name of the cited work. $fy$ is the publication year of the article which made the citation. $count$ is the number of citations which are at the intersection of these properties.

  • $[c]$ the number of citations each document receives
  • $[c*fj]$ the number of citations each document receives from each journal's articles
  • $[c*fy]$ the number of citations each document receives from each year's articles
  • $[fj]$ the number of citations from each journal
  • $[fj*fy]$ the number of citations in each journal in each year
  • $[t]$ cited term total counts
  • $[fy*t]$ cited term time series
  • term cooccurrence with citation and journal ($[c*t]$ and [fj*t]$)
  • "author" counts, the number of citations by each author ($[a]$ $[a*c]$ $[a*j*y]$)
  • [c*c]$, the cooccurrence network between citations
  • the death of citations can be studied using the $[c*fy]$ hypergraph
  • $[c*fj*t]$ could be used for analyzing differential associations of $c$ to $t$ across publication venues
  • $[ta*ta]$, $[fa*fa]$, $[t*t]$ and $[c*c]$ open the door to network-scientific methods

References

  • Martin, John Levi, and Monica Lee. 2018. “A Formal Approach to Meaning.” Poetics 68(February):10–17.
  • Geertz, Clifford. 1973. The Interpretation of Cultures. New York: Basic Books, Inc.

How to retrieve the requisite data

  • This notebook generates cooccurrence counts given Web of Science output.
  • No Web of Science output is included in knowknow in compliance with Web of Science terms of use.
  • This notebook is only useful if you have your own Web of Science output, and want to run analyses on that.

Data can be downloaded from any Web of Science search results page:

  1. Export -> Other File Formats.
  2. In the dropdowns, specify Record Content -> Full Record and Cited References and File Format -> Tab Delimited (Win)
  3. Type the folder containing the saved records as .txt files into the variable wos_base below.

Instructions

  1. Edit the parameters to fit your needs
  2. Run all cells

Parameters

In [3]:
#wos_base = "path/to/wos/data"
wos_base = "G:/My Drive/projects/qualitative analysis of literature/pre 5-12-2020/009 get everything from WOS"


#journal_map = {
#    "journal of social forces": "social forces",\
#    "studies in symbolic interaction*": "studies in symbolic interaction"    
#}

name_blacklist = [
    "*us", 'Press', 'Knopf', '(January', 'Co', 'London', 'Bros', 'Books', 'Wilson','[anonymous]'
]


debug = False
database_name = "sociology-wos"

RUN_PREGROUP = True

RUN_EVERYTHING = True


use_included_citations_filter = False
if use_included_citations_filter:
    included_citations = load_variable("%s-all/included_citations" % database_name)
    
    
use_included_journals_filter = True
journal_keep = ["ETHNIC AND RACIAL STUDIES", "LAW & SOCIETY REVIEW", "DISCOURSE & SOCIETY", "SOCIOLOGICAL INQUIRY", "CONTRIBUTIONS TO INDIAN SOCIOLOGY", "SOCIETY & NATURAL RESOURCES", "RATIONALITY AND SOCIETY", "DEVIANT BEHAVIOR", "ACTA SOCIOLOGICA", "SOCIOLOGY-THE JOURNAL OF THE BRITISH SOCIOLOGICAL ASSOCIATION", "WORK EMPLOYMENT AND SOCIETY", "SOCIOLOGICAL METHODS & RESEARCH", "SOCIOLOGICAL PERSPECTIVES", "JOURNAL OF MARRIAGE AND FAMILY", "WORK AND OCCUPATIONS", "JOURNAL OF CONTEMPORARY ETHNOGRAPHY", "THEORY AND SOCIETY", "POLITICS & SOCIETY", "SOCIOLOGICAL SPECTRUM", "RACE & CLASS", "ANTHROZOOS", "LEISURE SCIENCES", "COMPARATIVE STUDIES IN SOCIETY AND HISTORY", "SOCIAL SCIENCE QUARTERLY", "MEDIA CULTURE & SOCIETY", "SOCIOLOGY OF HEALTH & ILLNESS", "SOCIOLOGIA RURALIS", "SOCIOLOGICAL REVIEW", "TEACHING SOCIOLOGY", "BRITISH JOURNAL OF SOCIOLOGY", "JOURNAL OF THE HISTORY OF SEXUALITY", "SOCIOLOGY OF EDUCATION", "SOCIAL NETWORKS", "ARMED FORCES & SOCIETY", "YOUTH & SOCIETY", "POPULATION AND DEVELOPMENT REVIEW", "SOCIETY", "JOURNAL OF HISTORICAL SOCIOLOGY", "HUMAN ECOLOGY", "INTERNATIONAL SOCIOLOGY", "SOCIAL FORCES", "EUROPEAN SOCIOLOGICAL REVIEW", "JOURNAL OF HEALTH AND SOCIAL BEHAVIOR", "SOCIOLOGICAL THEORY", "SOCIAL INDICATORS RESEARCH", "POETICS", "HUMAN STUDIES", "SOCIOLOGICAL FORUM", "AMERICAN SOCIOLOGICAL REVIEW", "SOCIOLOGY OF SPORT JOURNAL", "SOCIOLOGY OF RELIGION", "JOURNAL OF LAW AND SOCIETY", "GENDER & SOCIETY", "BRITISH JOURNAL OF SOCIOLOGY OF EDUCATION", "LANGUAGE IN SOCIETY", "AMERICAN JOURNAL OF ECONOMICS AND SOCIOLOGY", "ANNALS OF TOURISM RESEARCH", "SOCIAL PROBLEMS", "INTERNATIONAL JOURNAL OF INTERCULTURAL RELATIONS", "SOCIAL SCIENCE RESEARCH", "SYMBOLIC INTERACTION", "JOURNAL OF LEISURE RESEARCH", "ECONOMY AND SOCIETY", "SOCIAL COMPASS", "SOCIOLOGICAL QUARTERLY", "JOURNAL OF MATHEMATICAL SOCIOLOGY", "AMERICAN JOURNAL OF SOCIOLOGY", "REVIEW OF RELIGIOUS RESEARCH", "RURAL SOCIOLOGY", "JOURNAL FOR THE SCIENTIFIC STUDY OF RELIGION", "ARCHIVES EUROPEENNES DE SOCIOLOGIE", "CANADIAN JOURNAL OF SOCIOLOGY-CAHIERS CANADIENS DE SOCIOLOGIE"]
journal_keep = [x.lower() for x in journal_keep]
In [4]:
groups=None

if RUN_PREGROUP:    
    # loading precomputed groupings of cited books and articles, if the grouping has been generated

    try:
        groups = load_variable("%s-all/groups" % database_name)
        ysumc = load_variable("%s-all/c.ysum" % database_name)
        print("Groups successfully loaded.")
        
        # find the most popular representation
        current_c = get_cnt("%s-all/doc"%database_name, ['c'])
        print("Citations successfully loaded.")

    except VariableNotFound:
        print("Groups don't exist yet. It's important to incorporate a fuzzy match for cited references. "
              "Run this script once, then run 'trend summaries/cysum.ipynb' to generate summaries and filter out small cited works. "
              "Then run 'grouping article and book names.ipynb' to generate groupings. "
              "Then run this notebook again to generate counts for grouped entries. "
              "And finally, to make sure cysum is up to date, run 'trend summaries/cysum.ipynb' one more time. "
             )
        
Groups successfully loaded.
Loaded keys: dict_keys(['c'])
Available keys: ['c', 'c.c', 'c.fj', 'c.fy', 'fj', 'fj.fy', 'fy', 'ty', 'ty.ty']
Citations successfully loaded.

First pass - just counting

In [5]:
from csv import DictReader
In [6]:
dcount = 0
total_inserts = 0
to_inserts = []
basedir = Path(wos_base)

# keeps track of all the years seen for grouped citations
multi_year = defaultdict(lambda: defaultdict(int))
In [7]:
# Instantiating counters
cnt_ind = defaultdict(lambda:defaultdict(int))
track_doc = defaultdict(lambda:defaultdict(set))
cnt_doc = defaultdict(lambda:defaultdict(int))

def cnt(term, space, doc):
    if ".".join(sorted(space.split("."))) != space:
        raise Exception(space, "should be sorted...")
    
    # it's a set, yo
    track_doc[space][term].add(doc)
    # update cnt_doc
    cnt_doc[space][term] = len(track_doc[space][term])
    # update ind count
    cnt_ind[space][term] += 1
In [8]:
# This cell ensures there are not overflow errors while importing large CSVs

import sys
import csv
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)
In [9]:
def fixcitedauth(a):
    a = a.strip()
    if not len(a):
        return None

    sp = a.lower().split()
    if len(sp) < 2:
        return None
    if len(sp) >= 5:
        return None
    
    l, f = a.lower().split()[:2] # take first two words
    
    if len(l) == 1: # sometimes last and first name are switched for some reason
        l, f = f, l
        
    f = f[0] + "." # first initial
    
    a = ", ".join([l, f]) # first initial
    a = a.title() # title format, so I don't have to worry later
    
    if debug:
        print('cited author:', a)
        
    return a


def fix_refs(refs):
    for r in refs:
        yspl = re.split("((?:18|19|20)[0-9]{2})", r)

        if len(yspl) < 2:
            continue

        auth, year = yspl[:2]
        auth = auth.strip()
        
        if len(auth.split(" ")) > 5:
            continue
        
        year = int(year)

        if auth == "":
            continue

        auth = fixcitedauth( auth )
        if auth is None: # catching non-people, and not counting the citations
            continue
        
        full_ref = r
        
        if 'DOI' not in full_ref and not ( # it's a book!
            len(re.findall(r', [Vv][0-9]+', full_ref)) or
            len(re.findall(r'[0-9]+, [Pp]?[0-9]+', full_ref))
        ):
            #full_ref = re.sub(r', [Pp][0-9]+', '', full_ref) # get rid of page numbers!
            
            full_ref = "|".join( # splits off the author and year, and takes until the next comma
                [auth]+
                [x.strip().lower() for x in full_ref.split(",")[2:3]]
            )
        else: # it's an article!
            # just adds a fixed name and date to the front
            full_ref = "|".join(
                [auth, str(year)] +
                [",".join( x.strip() for x in full_ref.lower().split(",")[2:] ).split(",doi")[0]]
                )

        if use_included_citations_filter:
            if full_ref not in included_citations:
                continue
            
        # implement grouping of references
        if groups is not None:
            if full_ref in groups:
                # retrieves retrospectively-computed groups
                full_ref = group_reps[
                    groups[full_ref]
                ]
            elif full_ref in ysumc and '[no title captured]' not in full_ref:
                # a small minority, the ones which are dropped in this process anyways
                #print('not in grouping')
                raise Exception(full_ref, "not in grouping!")
            else:
                continue
                
        if debug:
            print('fix_refs_worked',auth,year,full_ref)
        yield ( auth, year, full_ref )
        

pre-group cited references while counting

In [10]:
if RUN_PREGROUP:
    # loading precomputed groupings of cited books and articles, if the grouping has been generated
    if groups is None:
        print("Groups don't exist yet. It's important to incorporate a fuzzy match for cited references."
              "Run this script once, then run 'trend summaries/cysum.ipynb' to generate summaries and filter out small cited works."
              "Then run 'grouping article and book names.ipynb' to generate groupings."
              "Then run this notebook again to generate counts for grouped entries."
              "And finally, to make sure cysum is up to date, run 'trend summaries/cysum.ipynb' one more time."
             )
        raise Exception("Grouperror")
        

    from collections import defaultdict

    def get_reps(groups):
        ret = defaultdict(set)
        for k,v in groups.items():
            ret[v].add(k)
        ret = {
            k: max(v, key=lambda x:current_c['c'][x])
            for k,v in ret.items()
        }
        return ret

    group_reps = get_reps(groups)
In [11]:
# processes WoS txt output files one by one, counting relevant cooccurrences as it goes
dcount=0
for i, f in enumerate( list(basedir.glob("**/*.txt")) ):

    with f.open(encoding='utf8') as pfile:
        r = DictReader(pfile, delimiter="\t")
        rows = list(r)

    if i % 50 == 0:
        print("File %s/%s: %s" % (i, len(list(basedir.glob("**/*.txt"))),f.name))

    for i, r in enumerate(rows):
        if r['DT'] != "Article":
            continue
            

        #print(refs)

        dcount += 1
        if dcount % 10000 == 0:
            print("Document: %s" % dcount)
            print("Citations: %s" % len(cnt_doc['c']))
            
        if debug:
            print("DOCUMENT %s" % dcount)
            if dcount > 10:
                raise


        refs = r["CR"].strip().split(";")
        refs = list( fix_refs(refs) )

        if not len(refs):
            continue


        def fixcitingauth():
            authors = r['AU'].split(";")
            for x in authors:
                x = x.strip().lower()
                x = re.sub("[^a-zA-Z\s,]+", "", x) # only letters and spaces allowed
                
                xsp = x.split(", ")
                if len(xsp) < 2:
                    return xsp[0]
                elif len(xsp) > 2:
                    raise Exception("author with too many commas", x)
                    
                f, l = xsp[1], xsp[0]
                f = f[0] # take only first initial of first name
                
                yield "%s, %s" % (l,f)
            
        citing_authors = list(fixcitingauth())
        if not len(citing_authors):
            continue
        
        if debug:
            print("citing authors: ", citing_authors)

        if False:
            for i in range(10):
                print("-"*20)


        uid = r['UT']


        try:
            int(r['PY'])
        except ValueError:
            continue

        r['SO'] = r['SO'].lower() # REMEMBER THIS! lol everything is in lowercase... not case sensitive
        
        
        if use_included_journals_filter and r['SO'].lower() not in journal_keep:
            continue


        for (auth,year,full_ref) in refs:
                    
            ref = (auth,year)
           
        
            if ref[0] in name_blacklist:
                continue
            if "*" in ref[0]:
                continue
                
            # BEGIN COUNTING!!!
            
            multi_year[full_ref][year] += 1
                
            cnt(r['SO'], 'fj', uid)
            cnt(int(r['PY']), 'fy', uid)
            cnt(ref[1], 'ty', uid)
            cnt((full_ref, int(r['PY'])), 'c.fy', uid)
            cnt((full_ref, r['SO']), 'c.fj', uid)

            cnt((r['SO'],int(r['PY'])), 'fj.fy', uid)

            cnt(full_ref, 'c', uid)
            
            if not RUN_EVERYTHING:
                continue
                
            cnt((int(r['PY']), year), 'fy.ty', uid)
            cnt((r['SO'], year), 'fj.ty', uid)
                    
            cnt(auth, 'ta', uid)
            cnt((int(r['PY']),auth), 'fy.ta', uid)
            cnt((r['SO'],auth), 'fj.ta', uid)
            
            # first author!
            ffa = citing_authors[0]
            cnt(ffa, 'ffa', uid)
            cnt((ffa,int(r['PY'])), 'ffa.fy', uid)
            cnt((ffa,r['SO']), 'ffa.fj', uid)
            cnt((full_ref,ffa), 'c.ffa', uid)
            #cnt((ffa,r['SO'], int(r['PY'])), 'ffa.fj.fy', uid)

            for a in citing_authors:
                cnt(a, 'fa', uid)
                cnt((a,int(r['PY'])), 'fa.fy', uid)
                cnt((a,r['SO']), 'fa.fj', uid)
                #cnt((a,r['SO'], int(r['PY'])), 'fa.fj.fy', uid)

                cnt((full_ref,a), 'c.fa', uid)
            
            cnt((full_ref, int(r['PY']), r['SO']), 'c.fy.j', uid)
File 0/386: 74501-75000.txt
Document: 10000
Citations: 43016
File 50/386: 3501-4000.txt
Document: 20000
Citations: 67163
Document: 30000
Citations: 75090
File 100/386: 28501-29000.txt
Document: 40000
Citations: 83432
File 150/386: 53501-54000.txt
Document: 50000
Citations: 86586
Document: 60000
Citations: 89273
File 200/386: 4501-5000.txt
Document: 70000
Citations: 95138
Document: 80000
Citations: 100894
File 250/386: 29501-30000.txt
Document: 90000
Citations: 104857
Document: 100000
Citations: 107454
File 300/386: 54501-55000.txt
Document: 110000
Citations: 109413
Document: 120000
Citations: 110411
Document: 130000
Citations: 111036
File 350/386: 79501-80000.txt
Document: 140000
Citations: 111506
In [12]:
r['TI']
Out[12]:
'A SOCIOLOGICAL VIEW OF SOVEREIGNTY'
In [13]:
full_ref
Out[13]:
'Westermarck, E.|hist human marriage'
In [14]:
citing_authors
Out[14]:
['llano, a']

cocitation counter

Because there are so many cited works, a full cocitation network would be prohibitively large to compute and store on disk. Furthermore, the full network is not very useful. The following creates a cocitation network among only the most common 1000 cited works. It counts the number of times cnt['cc'][ (c1,c2) ] that c1 and c2 are cited together in a work. The ind and doc counters are identical for this counter, 'cc'.

In [15]:
if RUN_EVERYTHING:
    allowed_refs = Counter(dict(cnt_ind['c'].items())).most_common(1000)
    allowed_refs = set( x[0] for x in allowed_refs )

    print("# allowed references for cocitation analysis: %s" % len(allowed_refs))
    print("Examples: %s" % str(list(allowed_refs)[:3]))

    # enumerating cocitation for works with at least 10 citations
    dcount = 0
    refcount = 0

    for i, f in enumerate(list(basedir.glob("**/*.txt"))):

        with f.open(encoding='utf8') as pfile:
            r = DictReader(pfile, delimiter="\t")
            rows = list(r)

        if i % 50 == 0:
            print("File %s/%s: %s" % (i, len(list(basedir.glob("**/*.txt"))), f))

        for i, r in enumerate(rows):

            if r['DT'] != "Article":
                continue

            refs = r["CR"].strip().split(";")
            refs = list(fix_refs(refs))

            if not len(refs):
                continue

            uid = r['UT']

            try:
                int(r['PY'])
            except ValueError:
                continue

            for (auth,year,full_ref) in refs:

                if full_ref not in allowed_refs:
                    continue

                for (auth2,year2,full_ref2) in refs:

                    if full_ref2 <= full_ref:
                        continue

                    if full_ref2 not in allowed_refs:
                        continue

                    cnt((full_ref,full_ref2), 'c.c', uid)
                    cnt((year,year2), 'ty.ty', uid)

                    refcount += 1
                    if refcount % 10000 == 0:
                        print("%s cocitations logged" % refcount)

    print("Finished!")
# allowed references for cocitation analysis: 1000
Examples: ['Bell, D.|coming post industri', 'Rosenberg, M.|society adolescent s', 'Lincoln, E.|black church african']
File 0/386: G:\My Drive\projects\qualitative analysis of literature\pre 5-12-2020\009 get everything from WOS\74501-75000.txt
10000 cocitations logged
20000 cocitations logged
30000 cocitations logged
40000 cocitations logged
50000 cocitations logged
60000 cocitations logged
70000 cocitations logged
80000 cocitations logged
90000 cocitations logged
100000 cocitations logged
110000 cocitations logged
120000 cocitations logged
File 50/386: G:\My Drive\projects\qualitative analysis of literature\pre 5-12-2020\009 get everything from WOS\3501-4000.txt
130000 cocitations logged
140000 cocitations logged
150000 cocitations logged
160000 cocitations logged
170000 cocitations logged
180000 cocitations logged
190000 cocitations logged
200000 cocitations logged
210000 cocitations logged
File 100/386: G:\My Drive\projects\qualitative analysis of literature\pre 5-12-2020\009 get everything from WOS\28501-29000.txt
220000 cocitations logged
230000 cocitations logged
240000 cocitations logged
250000 cocitations logged
260000 cocitations logged
270000 cocitations logged
280000 cocitations logged
290000 cocitations logged
300000 cocitations logged
310000 cocitations logged
320000 cocitations logged
330000 cocitations logged
File 150/386: G:\My Drive\projects\qualitative analysis of literature\pre 5-12-2020\009 get everything from WOS\53501-54000.txt
340000 cocitations logged
350000 cocitations logged
360000 cocitations logged
370000 cocitations logged
380000 cocitations logged
390000 cocitations logged
400000 cocitations logged
410000 cocitations logged
420000 cocitations logged
430000 cocitations logged
440000 cocitations logged
450000 cocitations logged
460000 cocitations logged
File 200/386: G:\My Drive\projects\qualitative analysis of literature\pre 5-12-2020\009 get everything from WOS\2006 and back\4501-5000.txt
470000 cocitations logged
480000 cocitations logged
490000 cocitations logged
500000 cocitations logged
510000 cocitations logged
520000 cocitations logged
530000 cocitations logged
540000 cocitations logged
550000 cocitations logged
560000 cocitations logged
570000 cocitations logged
580000 cocitations logged
590000 cocitations logged
600000 cocitations logged
610000 cocitations logged
620000 cocitations logged
File 250/386: G:\My Drive\projects\qualitative analysis of literature\pre 5-12-2020\009 get everything from WOS\2006 and back\29501-30000.txt
630000 cocitations logged
640000 cocitations logged
650000 cocitations logged
660000 cocitations logged
670000 cocitations logged
680000 cocitations logged
690000 cocitations logged
700000 cocitations logged
710000 cocitations logged
720000 cocitations logged
730000 cocitations logged
740000 cocitations logged
750000 cocitations logged
760000 cocitations logged
File 300/386: G:\My Drive\projects\qualitative analysis of literature\pre 5-12-2020\009 get everything from WOS\2006 and back\54501-55000.txt
770000 cocitations logged
780000 cocitations logged
790000 cocitations logged
800000 cocitations logged
810000 cocitations logged
820000 cocitations logged
830000 cocitations logged
840000 cocitations logged
File 350/386: G:\My Drive\projects\qualitative analysis of literature\pre 5-12-2020\009 get everything from WOS\2006 and back\79501-80000.txt
850000 cocitations logged
Finished!

trim counters, excluding cited works with only a single reference

Because of the huge number of cited works that only occur once, for efficiency it's best to simply log this statistic and eliminate them from the counter.

This might not be necessary

In [16]:
trimCounters = False
In [17]:
if trimCounters:

    onlyOne = len([x for x in cnt_doc['c'] if cnt_doc['c'][x] == 1])
    total = len(cnt_doc['c'])
    print("Of the %s cited works, %s (%0.2f%%) have only a single citation" % (
        total, onlyOne,
        100 * onlyOne/total
    ))

    terms = list(cnt_doc['c'].keys())
    counts = np.array([cnt_doc['c'][k] for k in terms])

    cutoff = 2
    to_remove = set([terms[int(i)] for i in np.argwhere(counts < cutoff)])
    print("consolidating", len(to_remove), list(to_remove)[:5])

    cnt_doc.keys()

    print("old size:", len(cnt_doc['c']))
    for tr in to_remove:
        del cnt_doc['c'][tr]
        del cnt_ind['c'][tr]
    print("new size:", len(cnt_doc['c']))

    print("old size:", len(cnt_doc['cy']))
    tydels = [x for x in cnt_doc['cy'] if x[0] in to_remove]
    for tydel in tydels:
        del cnt_doc['cy'][tydel]
        del cnt_ind['cy'][tydel]
    print("new size:", len(cnt_doc['cy']))

    print("old size:", len(cnt_doc['jc']))
    jtdels = [x for x in cnt_doc['jc'] if x[1] in to_remove]
    for jtdel in jtdels:
        del cnt_doc['jc'][jtdel]
        del cnt_ind['jc'][jtdel]
    print("new size:", len(cnt_doc['jc']))
In [18]:
print(len(cnt_doc['c']), 'cited works')
111731 cited works

export generated variables

In [19]:
# retrieve and use the MOST COMMON pub date for each
pubyears = {
    k:max(s.keys(), key=lambda x:multi_year[k][x]) for k,s in multi_year.items()
    if len(s)
}
varname = "%s.pubyears"%database_name
save_variable(varname, pubyears)
print("saved %s" % varname)
saved sociology-wos.pubyears
In [20]:
save_cnt("%s/doc"%database_name, cnt_doc)
Saving sociology-wos.doc ___ fj
Saving sociology-wos.doc ___ fy
Saving sociology-wos.doc ___ ty
Saving sociology-wos.doc ___ c.fy
Saving sociology-wos.doc ___ c.fj
Saving sociology-wos.doc ___ fj.fy
Saving sociology-wos.doc ___ c
Saving sociology-wos.doc ___ fy.ty
Saving sociology-wos.doc ___ fj.ty
Saving sociology-wos.doc ___ ta
Saving sociology-wos.doc ___ fy.ta
Saving sociology-wos.doc ___ fj.ta
Saving sociology-wos.doc ___ ffa
Saving sociology-wos.doc ___ ffa.fy
Saving sociology-wos.doc ___ ffa.fj
Saving sociology-wos.doc ___ ffa.c
Saving sociology-wos.doc ___ fa
Saving sociology-wos.doc ___ fa.fy
Saving sociology-wos.doc ___ fa.fj
Saving sociology-wos.doc ___ fa.c
Saving sociology-wos.doc ___ c.fy.j
Saving sociology-wos.doc ___ c.c
Saving sociology-wos.doc ___ ty.ty
In [21]:
save_cnt("%s/ind"%database_name, cnt_ind)
Saving sociology-wos.ind ___ fj
Saving sociology-wos.ind ___ fy
Saving sociology-wos.ind ___ ty
Saving sociology-wos.ind ___ c.fy
Saving sociology-wos.ind ___ c.fj
Saving sociology-wos.ind ___ fj.fy
Saving sociology-wos.ind ___ c
Saving sociology-wos.ind ___ fy.ty
Saving sociology-wos.ind ___ fj.ty
Saving sociology-wos.ind ___ ta
Saving sociology-wos.ind ___ fy.ta
Saving sociology-wos.ind ___ fj.ta
Saving sociology-wos.ind ___ ffa
Saving sociology-wos.ind ___ ffa.fy
Saving sociology-wos.ind ___ ffa.fj
Saving sociology-wos.ind ___ ffa.c
Saving sociology-wos.ind ___ fa
Saving sociology-wos.ind ___ fa.fy
Saving sociology-wos.ind ___ fa.fj
Saving sociology-wos.ind ___ fa.c
Saving sociology-wos.ind ___ c.fy.j
Saving sociology-wos.ind ___ c.c
Saving sociology-wos.ind ___ ty.ty