Essential Sashimi etiquette

The domain-topic model

Let's call a group of documents a "domain" and, correspondingly, a group of terms (words) a "topic". Documents contain terms, and the idea of a domain-topic model is to group documents that tend to share the same terms, and group terms that tend to appear in the same documents. These groups are chosen to be exclusive: each document will be placed in a single domain, and each term in a single topic. A domain will thus contain documents that share a preference for certain topics, just as a topic will contain terms that are preferentially employed in certain domains.

We may also refer to these groups, in general, as "blocks". So domains are blocks whose elements are documents, and topics are blocks whose elements are terms. We call these level-1 blocks, and grouping blocks themselves produces higher levels: level-2 blocks contain level-1 blocks, level-3 blocks contain level-2 blocks, and so forth. At the highest level, all blocks coalesce into a single super-block, which for documents/domains corresponds to the full corpus and, for terms/topics, to the corpus' vocabulary. Finally, we speak of super-block for a block that contains another, and sub-block for the one that is contained.

A la carte

The map is a hierarchical block map, where each column stacks the blocks of a given level. Moving horizontally changes the level, with a block's super-block located to its left, and its sub-blocks to its right. To help distinguish the different types of blocks, domains are consistently presented in salmon, while topics employ the color tuna. Wasabi is used for other types of blocks, which we'll see later.

Features of blocks on the map:

  1. Color intensity conveys values for the salience of each block according to some criteria.
  2. Shown on the bottom-left is the original value, before normalization (useful for cross-level comparison).
  3. On the bottom-middle:
  4. On the bottom-right, a label indicates the block's level, type, and numbering.

In the starting maps, values correspond to the quantities shown in the bottom-middle relative to their totals: fractions of all documents found in a domain; fraction of total term usage coming from a topic. As we'll see, clicking on blocks on the opposite map will change this.

From the top-left, blocks contain text that qualifies the elements it contains. The form and meaning of this content will very depending on the type of block.

Block contents for domains:

Block contents for topics/other:

  1. frequent elements: top terms from the topic, ordered by highest frequency in the corpus.

My cursor, maihashi

Scroll down and up to zoom in and out of blocks. To help navigate the hierarchy, zooming in on a block stops when the block fills the frame. Move the cursor lower in the hierarchy (to the right) to continue zooming. Zooming on a block that is partially hidden immediately brings it entirely into the frame.

Click on a block to select and learn more about it: on the opposite side, the Map and the Info tab will be updated with information on the selected block. Shift-click allows selecting multiple blocks, and Ctrl-Shift-click allows deselecting individual blocks. You may deselect all selected blocks in a map with a Control-click elsewhere on the map, or deselect everything in all maps by pressing Escape on the keyboard.

Upon selecting a domain

The color and values of the Topic map will be updated to show, for each topic, the logarithm of the ratio between how much the topic is used in the selected domain(s) against how much it gets used overall. This quantity represents how much a topic is over (positive) or under (negative) represented in the selected domain, with shades of grey standing for negative values.

The Domain info tab

Two line plots may appear on top of the tab. The first represents the distribution of documents along some dimension, usually time. It concerns the documents belonging to the (first chosen) domain. The red line corresponds to the fraction of documents at that point, and the grey line to the absolute number of documents. When a domain is selected, the plot gets restricted to the documents it contains. The second shows multiple colored lines, one for each characteristic topic, representing the distribution along the same dimension of documents using that topic.

Furthermore, the tab displays the full information partially seen on top of the domain block(s), plus a list of titles from documents in the domain, with links to those documents' URLs if that was provided when building the map.

When selecting multiple domains, the Topic map's color and value will depend on the choice presented in the selector (below it and to the right of the Search bar):

Upon selecting a topic

The color and values of the Domain map will be updated to show, for each domain, the logarithm of the ratio between how much the selected topic is used in each domain against how much it gets used overall. This quantity represents how much a domain over (positive) or under (negative) represents the selected topic, with shades of grey standing for negative values.

The Topic info tab, in addition to the full information seen on top of the topic block(s), will provide a list of titles from documents that contain any terms from the topic. At level 1, frequent terms displays the full list of terms in the topic.

When selecting multiple topics, the Topic map's color and value behave in a manner analogous to multiple domains.

Searching

You'll find a Search bar on top of the maps. It recognizes partial matches, and allows searching blocks by label (i.e. "L2D3", "L4T2"), by term (i.e. "green", "yellow") wihch will select the containing topic, and by title (i.e. "A study of wonderful things") which will select the containing domain. Use it wisely.

Au revoir, またいつか

Bibliography (also to credit where appropriate)

Hannud Abdo, A., Cointet, J.-P., Bourret, P., & Cambrosio, A. (2022). Domain-topic models with chained dimensions: Charting an emergent domain of a major oncology conference. Journal of the Association for Information Science and Technology, 73( 7), 992– 1011. https://doi.org/10.1002/asi.24606