{% extends "_base.html" %} {% block header_javascripts %} {% endblock %} {% block content %}

What is Interactive Clustering methodology ?

Interactive Clustering is a method intended to assist in the design of a training data set. The main objective it to create a labelled dataset without a prior/subjective definition of represented thema. In fact, it is based on an active learning methodology where the computer suggest a data partitionning and the expert correct iteratively the computer decision.

This iterative process begins with an unlabeled dataset, and it uses a sequence of two substeps :

  1. the user defines constraints on data sampled by the computer;
  2. the computer performs data partitioning using a constrained clustering algorithm.
Thus, at each step of the process:
  • the user corrects the clustering of the previous steps using constraints, and
  • the computer offers a corrected and more relevant data partitioning for the next step.
At the end of the process, we obtain a relevant data partitioning that we can use to train a classification model.

Schema of Interactive Clustering process.
Schema of Interactive Clustering process.

Use this method avoid performing the following annoying tasks:

  • No more intent abstract definition: Annotation by label requires to define the list of possible labels to annotate. The choice of label and their defintion is most often based on a subjective with of annotators, and this task is abstract and can lead to incomprehension or ambiguities. Here, intents are discovered during clustering computation and correction.
  • No more complex label annotation: Choosing an intent for data is an abstract task when the number of labels is large, and careless errors are common. Here, the description of data similarity with an constraint is more intuitive, because it corresponds to compare the similarity of expected actions (i.e. do you answer the same way to these two questions?)
However, there are some points of attention during the process:
  • Convergence of iterations can be optimized with various strategies. In fact, constraints sampling and constrained clustering are the two main steps to set carefully: A bad clustering algorithm will gives an unrelevant result, and an badly tuned sampling algorithm will not correct the clustering execution. Therefore settings should be chosen wisely.
  • Conflicts can occurs after mistakes. In fact, the computer can face inconsistency in constraints, leading to contradiction in the execution. A step of constraint review is needed to prevent these conflicts.

For more details, see Frequently Asked Questions and read articles in References sections.

Do you want to start the experiment?


Frequently Asked Questions

What is a clustering algorithm?

It's an unsupervised algorithm aimed at group data by their similarities. In NLP, it can use common linguistic patterns, lexical or syntactical similarities, word vectors distance, ...

Example of clustering with three topics.
Example of clustering with three topics

The main advantage of such algorithms is the ability to explore data in order to find topics. However, experts often consider raw results to be of low value (hard to distinguishing ambiguous formulations, to dealing with unbalanced topics, etc...). Thus, to have semantically relevant results, manual corrections are sometimes necessary.

What is a constraint on data?

It's information given by the expert on data similarity. We deal with two type of constraints:

  • MUST LINK : data are similar (i.e. same cluster / intent / answer / action/...);
  • CANNOT LINK : data are not similar (i.e. different clusters / intents / answers / actions/...).

Example of constraints (MUST LINK, CANNOT LINK).
Example of constraints.
(MUST LINK in green, CANNOT LINK in red).

It can be used in constrained clustering to guide the computer operation.

Example of unconstrained clutering. Example of constrained clutering.
Comparison between unconstrained and constrained clustering.
Why use sampling to chose data to annotate?

The sampling step is needed to determine the constraints to be annotated to most effectively correct the clustering functioning. Sampling strategies can be based on:

  • random selection : it gives the baseline;
  • previous clustering results : it allows to influence the functioning of the algorithm;
  • questions distance : it allows to influence the representation of data;
  • a combination of previous strategies like:
    • closest neighbors in different clusters, checking if border of clusters are misplaced;
    • farthest neighbors in same cluster, checking if a cluster is unfortunately an agglomerate of several distinct themes.

Example of constraints sampling.
Example of constraints sampling.
Why is the methodology iterative?
TODO
What should I do if data is ambiguous?
TODO
Waht is constraints groundtruth?
TODO
What is a constraint inference?
TODO
Example infered MUST LINK constraint. Example infered CANNOT LINK constraint.
Example of constraint inference.
On left side: (a) MUST LINK + (b) MUST LINK implies (c) MUST LINK ;
On right side: (a) MUST LINK + (b) CANNOT LINK implies (c) CANNOT LINK.
What is a conflict?
TODO
Example constraints conflicts.
Example constraints conflicts.
In fact: (a) MUST LINK + (b) MUST LINK implies (c) MUST LINK, but the annotation is CANNOT LINK: there is an error somewhere.
What is the constraint completude?
TODO
When is the job done?
TODO

User documentation

    TODO

References

{% endblock %}