Overview

Dataset statistics

Number of variables12
Number of observations891
Missing cells866
Missing cells (%)8.1%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory83.7 KiB
Average record size in memory96.1 B

Variable types

CAT6
NUM5
BOOL1

Warnings

Ticket has a high cardinality: 681 distinct values High cardinality
Cabin has a high cardinality: 147 distinct values High cardinality
Age has 177 (19.9%) missing values Missing
Cabin has 687 (77.1%) missing values Missing
Ticket is uniformly distributed Uniform
Cabin is uniformly distributed Uniform
PassengerId has unique values Unique
Name has unique values Unique
SibSp has 608 (68.2%) zeros Zeros
Parch has 678 (76.1%) zeros Zeros
Fare has 15 (1.7%) zeros Zeros

Reproduction

Analysis started2020-10-29 01:25:51.693406
Analysis finished2020-10-29 01:26:10.247406
Duration18.55 seconds
Software versionpandas-profiling v2.9.0
Download configurationconfig.yaml

Variables

PassengerId
Real number (ℝ≥0)

UNIQUE

Distinct891
Distinct (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean446
Minimum1
Maximum891
Zeros0
Zeros (%)0.0%
Memory size7.0 KiB

Quantile statistics

Minimum1
5-th percentile45.5
Q1223.5
median446
Q3668.5
95-th percentile846.5
Maximum891
Range890
Interquartile range (IQR)445

Descriptive statistics

Standard deviation257.353842
Coefficient of variation (CV)0.5770265516
Kurtosis-1.2
Mean446
Median Absolute Deviation (MAD)223
Skewness0
Sum397386
Variance66231
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
89110.1%
 
29310.1%
 
30410.1%
 
30310.1%
 
30210.1%
 
30110.1%
 
30010.1%
 
29910.1%
 
29810.1%
 
29710.1%
 
Other values (881)88198.9%
 
ValueCountFrequency (%) 
110.1%
 
210.1%
 
310.1%
 
410.1%
 
510.1%
 
ValueCountFrequency (%) 
89110.1%
 
89010.1%
 
88910.1%
 
88810.1%
 
88710.1%
 

Survived
Boolean

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size7.0 KiB
0
549 
1
342 
ValueCountFrequency (%) 
054961.6%
 
134238.4%
 

Pclass
Categorical

Distinct3
Distinct (%)0.3%
Missing0
Missing (%)0.0%
Memory size7.0 KiB
3
491 
1
216 
2
184 
ValueCountFrequency (%) 
349155.1%
 
121624.2%
 
218420.7%
 
Frequencies of value counts

Unique

Unique0 ?
Unique (%)0.0%
Histogram of lengths of the category

Length

Max length1
Median length1
Mean length1
Min length1

Name
Categorical

UNIQUE

Distinct891
Distinct (%)100.0%
Missing0
Missing (%)0.0%
Memory size7.0 KiB
bazzani, miss. albina
 
1
silvey, mrs. william baird (alice munger)
 
1
pernot, mr. rene
 
1
andrew, mr. edgardo samuel
 
1
elias, mr. tannous
 
1
Other values (886)
886 
ValueCountFrequency (%) 
bazzani, miss. albina10.1%
 
silvey, mrs. william baird (alice munger)10.1%
 
pernot, mr. rene10.1%
 
andrew, mr. edgardo samuel10.1%
 
elias, mr. tannous10.1%
 
sinkkonen, miss. anna10.1%
 
nicholson, mr. arthur ernest10.1%
 
bailey, mr. percy andrew10.1%
 
rosblom, mr. viktor richard10.1%
 
leyson, mr. robert william norman10.1%
 
Other values (881)88198.9%
 
Frequencies of value counts

Unique

Unique891 ?
Unique (%)100.0%
Histogram of lengths of the category

Length

Max length82
Median length25
Mean length26.96520763
Min length12

Sex
Categorical

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size7.0 KiB
male
577 
female
314 
ValueCountFrequency (%) 
male57764.8%
 
female31435.2%
 
Frequencies of value counts

Unique

Unique0 ?
Unique (%)0.0%
Histogram of lengths of the category

Length

Max length6
Median length4
Mean length4.704826038
Min length4

Age
Real number (ℝ≥0)

MISSING

Distinct88
Distinct (%)12.3%
Missing177
Missing (%)19.9%
Infinite0
Infinite (%)0.0%
Mean29.69911765
Minimum0.42
Maximum80
Zeros0
Zeros (%)0.0%
Memory size7.0 KiB

Quantile statistics

Minimum0.42
5-th percentile4
Q120.125
median28
Q338
95-th percentile56
Maximum80
Range79.58
Interquartile range (IQR)17.875

Descriptive statistics

Standard deviation14.52649733
Coefficient of variation (CV)0.4891221855
Kurtosis0.1782741536
Mean29.69911765
Median Absolute Deviation (MAD)9
Skewness0.3891077823
Sum21205.17
Variance211.0191247
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
24303.4%
 
22273.0%
 
18262.9%
 
19252.8%
 
28252.8%
 
30252.8%
 
21242.7%
 
25232.6%
 
36222.5%
 
29202.2%
 
Other values (78)46752.4%
 
(Missing)17719.9%
 
ValueCountFrequency (%) 
0.4210.1%
 
0.6710.1%
 
0.7520.2%
 
0.8320.2%
 
0.9210.1%
 
ValueCountFrequency (%) 
8010.1%
 
7410.1%
 
7120.2%
 
70.510.1%
 
7020.2%
 

SibSp
Real number (ℝ≥0)

ZEROS

Distinct7
Distinct (%)0.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.5230078563
Minimum0
Maximum8
Zeros608
Zeros (%)68.2%
Memory size7.0 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q31
95-th percentile3
Maximum8
Range8
Interquartile range (IQR)1

Descriptive statistics

Standard deviation1.102743432
Coefficient of variation (CV)2.108464374
Kurtosis17.88041973
Mean0.5230078563
Median Absolute Deviation (MAD)0
Skewness3.695351727
Sum466
Variance1.216043077
MonotocityNot monotonic
Histogram with fixed size bins (bins=7)
ValueCountFrequency (%) 
060868.2%
 
120923.5%
 
2283.1%
 
4182.0%
 
3161.8%
 
870.8%
 
550.6%
 
ValueCountFrequency (%) 
060868.2%
 
120923.5%
 
2283.1%
 
3161.8%
 
4182.0%
 
ValueCountFrequency (%) 
870.8%
 
550.6%
 
4182.0%
 
3161.8%
 
2283.1%
 

Parch
Real number (ℝ≥0)

ZEROS

Distinct7
Distinct (%)0.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.3815937149
Minimum0
Maximum6
Zeros678
Zeros (%)76.1%
Memory size7.0 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile2
Maximum6
Range6
Interquartile range (IQR)0

Descriptive statistics

Standard deviation0.8060572211
Coefficient of variation (CV)2.112344071
Kurtosis9.778125179
Mean0.3815937149
Median Absolute Deviation (MAD)0
Skewness2.749117047
Sum340
Variance0.6497282437
MonotocityNot monotonic
Histogram with fixed size bins (bins=7)
ValueCountFrequency (%) 
067876.1%
 
111813.2%
 
2809.0%
 
550.6%
 
350.6%
 
440.4%
 
610.1%
 
ValueCountFrequency (%) 
067876.1%
 
111813.2%
 
2809.0%
 
350.6%
 
440.4%
 
ValueCountFrequency (%) 
610.1%
 
550.6%
 
440.4%
 
350.6%
 
2809.0%
 

Ticket
Categorical

HIGH CARDINALITY
UNIFORM

Distinct681
Distinct (%)76.4%
Missing0
Missing (%)0.0%
Memory size7.0 KiB
347082
 
7
1601
 
7
ca. 2343
 
7
347088
 
6
3101295
 
6
Other values (676)
858 
ValueCountFrequency (%) 
34708270.8%
 
160170.8%
 
ca. 234370.8%
 
34708860.7%
 
310129560.7%
 
ca 214460.7%
 
38265250.6%
 
s.o.c. 1487950.6%
 
11376040.4%
 
34707740.4%
 
Other values (671)83493.6%
 
Frequencies of value counts

Unique

Unique547 ?
Unique (%)61.4%
Histogram of lengths of the category

Length

Max length18
Median length6
Mean length6.750841751
Min length3

Fare
Real number (ℝ≥0)

ZEROS

Distinct248
Distinct (%)27.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean32.20420797
Minimum0
Maximum512.3292
Zeros15
Zeros (%)1.7%
Memory size7.0 KiB

Quantile statistics

Minimum0
5-th percentile7.225
Q17.9104
median14.4542
Q331
95-th percentile112.07915
Maximum512.3292
Range512.3292
Interquartile range (IQR)23.0896

Descriptive statistics

Standard deviation49.6934286
Coefficient of variation (CV)1.543072528
Kurtosis33.39814088
Mean32.20420797
Median Absolute Deviation (MAD)6.9042
Skewness4.78731652
Sum28693.9493
Variance2469.436846
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
8.05434.8%
 
13424.7%
 
7.8958384.3%
 
7.75343.8%
 
26313.5%
 
10.5242.7%
 
7.925182.0%
 
7.775161.8%
 
26.55151.7%
 
0151.7%
 
Other values (238)61569.0%
 
ValueCountFrequency (%) 
0151.7%
 
4.012510.1%
 
510.1%
 
6.237510.1%
 
6.437510.1%
 
ValueCountFrequency (%) 
512.329230.3%
 
26340.4%
 
262.37520.2%
 
247.520820.2%
 
227.52540.4%
 

Cabin
Categorical

HIGH CARDINALITY
MISSING
UNIFORM

Distinct147
Distinct (%)72.1%
Missing687
Missing (%)77.1%
Memory size7.0 KiB
b96 b98
 
4
g6
 
4
c23 c25 c27
 
4
f33
 
3
f2
 
3
Other values (142)
186 
ValueCountFrequency (%) 
b96 b9840.4%
 
g640.4%
 
c23 c25 c2740.4%
 
f3330.3%
 
f230.3%
 
e10130.3%
 
c22 c2630.3%
 
d30.3%
 
c9220.2%
 
c12520.2%
 
Other values (137)17319.4%
 
(Missing)68777.1%
 
Frequencies of value counts

Unique

Unique101 ?
Unique (%)49.5%
Histogram of lengths of the category

Length

Max length15
Median length3
Mean length3.134680135
Min length1

Embarked
Categorical

Distinct3
Distinct (%)0.3%
Missing2
Missing (%)0.2%
Memory size7.0 KiB
s
644 
c
168 
q
77 
ValueCountFrequency (%) 
s64472.3%
 
c16818.9%
 
q778.6%
 
(Missing)20.2%
 
Frequencies of value counts

Unique

Unique0 ?
Unique (%)0.0%
Histogram of lengths of the category

Length

Max length3
Median length1
Mean length1.004489338
Min length1

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Missing values

Sample

First rows

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
038003gustafsson, mr. karl gideonmale19.0003470697.7750NaNs
123203larsson, mr. bengt edvinmale29.0003470677.7750NaNs
252503kassem, mr. faredmaleNaN0027007.2292NaNc
383113yasbeck, mrs. antoni (selini alexander)female15.010265914.4542NaNc
46103sirayanian, mr. orsenmale22.00026697.2292NaNc
582003skoog, master. karl thorstenmale10.03234708827.9000NaNs
69603shorney, mr. charles josephmaleNaN003749108.0500NaNs
78911fortune, miss. mabel helenfemale23.03219950263.0000c23 c25 c27s
88401carrau, mr. francisco mmale28.00011305947.1000NaNs
92703emir, mr. farred chehabmaleNaN0026317.2250NaNc

Last rows

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88186712duran y more, miss. asuncionfemale27.010sc/paris 214913.8583NaNc
88232312slayter, miss. hilda maryfemale30.00023481812.3500NaNq
88310813moss, mr. albert johanmaleNaN003129917.7750NaNs
88435201williams-lambert, mr. fletcher fellowsmaleNaN0011351035.0000c128s
88553811leroy, miss. berthafemale30.000pc 17761106.4250NaNc
88685613aks, mrs. sam (leah rosen)female18.0013920919.3500NaNs
8874412laroche, miss. simonne marie anne andreefemale3.012sc/paris 212341.5792NaNc
88846603goncalves, mr. manuel estanslasmale38.000soton/o.q. 31013067.0500NaNs
88924812hamalainen, mrs. william (anna)female24.00225064914.5000NaNs
89063802collyer, mr. harveymale31.011c.a. 3192126.2500NaNs