Visualizer:

Automate the process of visualization.

Introduction:

Visualizer is a Python3 package that facilitates the process of visualization whether by automation which saves a huge amount of time,
or by specifiying which relationship to visualize or which type of plot you want to show.

Installation:

pip install -U visualizer

Features:

Visualizer package allows you to do 2 types of plotting:

  1. Visualize by an individual column:
    • Count Plot.
    • Pie Plot.
    • Histogram plot.
    • KDE plot.
    • WordCloud plot.
    • Histogram for high cardinality columns.
    • Line plot with index.
    • Point plot with index.
    • Clustered-bar Plot.
    • Bubble plot.
    • Scatter plot.
    • Density plot.
    • Box plot.
    • Violin plot.
    • Ridge plot.
    • Parallel plot.
    • Radar plot.
  1. Visualize by a relationship (multiple-columns):
    • Uni-vairate Target.
    • Uni-variate Categorical (Cat).
    • Uni-variate Numerical (Num).
    • Bi-variate Num with Index.
    • Bi-variate Cat with Index.
    • Bi-variate Num with Num.
    • Bi-variate Num with Cat.
    • Bi-variate Cat with Cat.
    • Bi-variate Cat with Target.
    • Bi-variate Num with Target.
    • Multi-variate Nums with Cat.

NOTE:

The first type of plotting starts with create_, the second type of plotting starts with visualize_ as you will see next.

Usage:

In [7]:
# Import the essential libraries.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)
sns.set_style('white')
In [8]:
# Reading the data set to visualize.
df = pd.read_csv("/media/mosaab/Volume/Personal/Development/Courses Docs/ML GrandMaster/ml_project/input/new_df.csv")

# print the shape of the data.
df.shape
Out[8]:
(2001, 18)
In [9]:
df.head()
Out[9]:
bin_3 nom_0 nom_1 nom_5 nom_6 nom_7 nom_8 nom_9 ord_3 ord_5 target SalePrice GarageArea GrLivArea 1stFlrSF 1stFlrSF.1 TotalBsmtSF LotArea
0 T Green Triangle 50f116bcf 3ac1b8814 68f6ad3e9 c389000ab 2f4cb3d51 h kr 0 208500.0 548.0 1710.0 856.0 856.0 856.0 8450.0
1 T Green Trapezoid b3b4d25d0 fbcb50fc1 3b6dd5612 4cd920251 f83c56c21 a bF 0 181500.0 460.0 1262.0 1262.0 1262.0 1262.0 9600.0
2 F Blue Trapezoid 3263bdce5 0922e3cb8 a6a36f527 de9c9f684 ae6800dd0 h Jc 0 223500.0 608.0 1786.0 920.0 920.0 920.0 11250.0
3 F Red Trapezoid f12246592 50d7ad46a ec69236eb 4ade6ab69 8270f0d71 i kW 1 140000.0 642.0 1717.0 961.0 961.0 756.0 9550.0
4 F Red Trapezoid 5b0f5acd5 1fe17a1fd 04ddac2be cb43ab175 b164b72a7 a qP 0 250000.0 836.0 2198.0 1145.0 1145.0 1145.0 14260.0

We can see that the dataframe has:

  • 10 categorical columns.
  • 6 numerical columns.
  • 1 target categorical column to test binary classification problem.
  • 1 target numerical column to test regression problem.

Let's import Visualizer package:

In [18]:
# Import the visualizer.
from visualizer import Visualizer
In [19]:
# instantiate the class.
vis = Visualizer(df=df,
#                num_cols=[],
#                cat_cols=[],
                 target_col='SalePrice',
                 ignore_cols=['ord_5'],
                 problem_type='regression')

# Parameters:
## 1. df:            (pandas.dataframe) dataframe                                                       (required)
## 2. target_col:    (str)              target column.                                                  (required)
## 3. problem_type:  (str)              type of problem type whether "classification" or "regression".  (required)
## 3. num_cols:      (list)             numerical columns.                                              (optional)
## 4. cat_cols:      (list)             categorica columns.                                             (optional)
## 5. ignore_cols:   (list)             columns to ignore their visualizations.                         (optional)

Visualize:

Visualize_all():

This type automates the process of visualizing all the relationships between your features, and also let you choose which kind of plottings you want to save.
It will store all the generated figures into your current directory into a folder named "visualizer".

In [4]:
# Visualize all the relationships of the features.
vis.visualize_all()

# If you want NOT to include a specific plotting, you can do the following:
# vis.visualize_all(use_kde_plot=False)

# This will not create any KDE plots, and you can that 
# for every type of plottings with the same way.
 Uni-variate Cat : 100.0%
 Uni-variate Num : 100.0%
 Bi-variate Num with Index : 100.0%
 Bi-variate Cat with Index : 100.0%
 Bi-variate Cat with Cat : 100.0%
 Bi-variate Num with Num : 100.0%
 Bi-variate Num with Cat : 100.0%
 Bi-variate Cat with Target : 100.0%
 Bi-variate Num with Target : 100.0%
 Multi-variate Nums with Cat : 100.0%

After the process is done, your current path will be like this:

image.png

Uni-variate Plotting:

This type of plotting plot each column individually whether it's categorical and numerical with different types of plots.

NOTE: All the plottings will be saved in the visualizer folder.

In [6]:
# Visualize the target column helps to know if we have imbalanced problem or not.
# The following parameters are True by default.
vis.visualize_target(use_count_plot=True,
                     use_pie_plot=True,
                     use_kde_plot=True,
                     use_hist_plot=True)

# Visualizing the categorical columns helps us how to encode them properly.
vis.visualize_cat()

# Visalizing the numerical columns helps to know if we need to transform some columns or not.
vis.visualize_num()
 Uni-variate Target : 100.0%
 Uni-variate Cat : 100.0%
 Uni-variate Num : 100.0%

Bi-variate Plotting:

This type of plotting plot the relationship between 2 columns.

NOTE: All the plottings will be saved in the visualizer folder.

In [10]:
# visualize the relationships between categorical columns and numerical columns.
vis.visualize_cat_with_cat()
vis.visualize_num_with_cat()
vis.visualize_num_with_num()

# visualize the relationships between the categorical or numerical columns with the index.
# this is very helpful when you have a time-series data.
vis.visualize_cat_with_idx()
vis.visualize_num_with_idx()

# visualize the relationships between the target and the rest of the columns.
# This is very helpful to see the correlation between the target and the other columns,
# so you can build an intuition which columns are important to be kept and which ones are redundant.
vis.visualize_cat_with_target()
vis.visualize_num_with_target()
 Bi-variate Cat with Cat : 100.0%
 Bi-variate Num with Cat : 100.0%
 Bi-variate Num with Num : 100.0%
 Bi-variate Cat with Index : 100.0%
 Bi-variate Num with Index : 100.0%
 Bi-variate Cat with Target : 100.0%
 Bi-variate Num with Target : 100.0%

Multi-variate Plotting:

This type of plotting plots the relationships between more than 2 columns.

In [11]:
# This type of plotting is very helpful to find a pattern in your dataset.
vis.visualize_multi_variate()
 Multi-variate Nums with Cat : 100.0%

Individual Plotting:

This functionality allows you to show any plotting available in the visualizer package in the notebook.
All the methods for this functionality starts with create_, and you can use call it without instantiating the class!

Let's see:

Uni-Variate Plotting:

In [9]:
# Uni-variate plotting for categorical columns.
# 1. Count plot
Visualizer.create_count_plot(df=df,                   # ---> dataframe                                               (required)
                             cat_col='bin_3',         # ---> name of categorical column in the mentioned dataframe.  (required)
                             figsize=(6, 4),          # ---> figure size                                             (optional)
                             annot=True,              # ---> show annotation of the percentage [True, False]         (optional)
                             rotate=False)            # ---> Rotate the x labels [True, False]                       (optional)
In [11]:
# 2. Pie plot.
Visualizer.create_pie_plot(df=df,              # ---> dataframe                                               (required)
                           cat_col='nom_1',    # ---> name of categorical column in the mentioned dataframe.  (required)
                           figsize=(6, 6))     # ---> figure size                                             (optional)
In [12]:
# Uni-variate for numerical columns.
# 3. Historgram Plot.
Visualizer.create_hist_plot(df=df,                # ---> dataframe                                             (required)
                            num_col='SalePrice',  # ---> name of numerical column in the mentioned dataframe.  (required)
                            figsize=(8, 6))       # ---> figure size                                           (optional)
In [15]:
# 4. KDE plot.
Visualizer.create_kde_plot(df=df,
                           num_col='GarageArea',
                           target_col=None,
                           figsize=(8, 6))
In [17]:
# 5. WordCloud for categorical column with a high cardinality labels.
Visualizer.create_wordcloud(df=df,
                            cat_col='ord_5',
                            figsize=(10, 10))
In [20]:
# 6. Histogram for categorical feature with a high cardinality labels.
Visualizer.create_hist_for_high_cardinality(df=df,
                                            cat_col='ord_5',
                                            annot=True)

# This plot is very helpful answering the question:
# How many labels occured n times?
# For example, we can say by looking at the most right side,
# that there are 2 labels that occured 30 times in the whole dataset for that column.

Bi-variate Plotting:

In [22]:
# 7. Line plot along the index.
# This plot is very helpful when we have time-series dataset.
Visualizer.create_line_with_index(df=df,
                                  num_col='SalePrice',
                                  target_col=None,
                                  figsize=(25, 6))
In [13]:
# 8. Point line along the index.
Visualizer.create_point_with_index(df=df,
                                   num_col='ord_5',
                                   target_col='target',
                                   figsize=(25, 6))
In [30]:
# 9. Clustered Bar Chart.
# This plot is very helpful to see how 2 categorical features are correlated together.
Visualizer.create_clustered_bar_plot(df=df,
                                     cat_1='nom_0',
                                     cat_2='bin_3',
                                     figsize=(14, 8))
In [32]:
# 10. Bubble Chart.
# Used between 2 categorical features.
Visualizer.create_bubble_plot(df=df,
                              cat_1='bin_3',
                              cat_2='nom_0',
                              figsize=(12, 6))
In [35]:
# 11. Scatter Plot.
# used between 2 numerical columns.
Visualizer.create_scatter_plot(df=df,
                               num_1='SalePrice',
                               num_2='GarageArea',
                               target_col='target',
                               figsize=(12, 6))
In [37]:
# 12. Density Plot.
# used between 2 numerical columns.
Visualizer.create_density_plot(df=df,
                               num_1='SalePrice',
                               num_2='GarageArea',
                               figsize=(8, 6))
In [38]:
# 13. Box Plot.
# used between categorical and numerical column.
Visualizer.create_box_plot(df=df,
                           num_col='SalePrice',
                           cat_col='bin_3',
                           figsize=(8, 6))
In [39]:
# 14. Violin Plot.
# Used between 1 categorical and 1 numerical columns.
Visualizer.create_violin_plot(df=df,
                              num_col='SalePrice',
                              cat_col='nom_0')
In [40]:
# 15. Ridge Plot.
# Used between categorical column and numerical column.
Visualizer.create_ridge_plot(df=df,
                             num_col='SalePrice',
                             cat_col='nom_0')

Multi-variate Plotting:

In [44]:
# 16. Parallel Plot.
# Used between multiple columns more than 2.
num_cols = ['GarageArea', 'GrLivArea', '1stFlrSF', 'SalePrice']

Visualizer.create_parallel_plot(df=df,
                                num_cols=num_cols,
                                target_col='target',
                                figsize=(20, 10))
In [45]:
# 17. Radar Plot.
# Used between multipl numerical columns and one categorical column.
Visualizer.create_radar_plot(df=df,
                             num_cols=num_cols,
                             cat_col='target')

Further Development:

  1. Visualize Sparse Columns, to see if they have a pattern.
  2. Visualize NaN/Infinite/Large numeric values across the whole dataframe, to see the pattern of the whole dataframe.
  3. Visualize Text columns.
  4. Add the functionality to arrange the structure of the folders to be by columns, so each column has all the relationships for a specific column.