Hide keyboard shortcuts

Hot-keys on this page

r m x p   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

1"""Principal Component Analysis 

2 

3Author: josef-pktd 

4Modified by Kevin Sheppard 

5""" 

6 

7import numpy as np 

8import pandas as pd 

9 

10from statsmodels.tools.sm_exceptions import (ValueWarning, 

11 EstimationWarning) 

12 

13 

14def _norm(x): 

15 return np.sqrt(np.sum(x * x)) 

16 

17 

18class PCA(object): 

19 """ 

20 Principal Component Analysis 

21 

22 Parameters 

23 ---------- 

24 data : array_like 

25 Variables in columns, observations in rows. 

26 ncomp : int, optional 

27 Number of components to return. If None, returns the as many as the 

28 smaller of the number of rows or columns in data. 

29 standardize : bool, optional 

30 Flag indicating to use standardized data with mean 0 and unit 

31 variance. standardized being True implies demean. Using standardized 

32 data is equivalent to computing principal components from the 

33 correlation matrix of data. 

34 demean : bool, optional 

35 Flag indicating whether to demean data before computing principal 

36 components. demean is ignored if standardize is True. Demeaning data 

37 but not standardizing is equivalent to computing principal components 

38 from the covariance matrix of data. 

39 normalize : bool , optional 

40 Indicates whether to normalize the factors to have unit inner product. 

41 If False, the loadings will have unit inner product. 

42 gls : bool, optional 

43 Flag indicating to implement a two-step GLS estimator where 

44 in the first step principal components are used to estimate residuals, 

45 and then the inverse residual variance is used as a set of weights to 

46 estimate the final principal components. Setting gls to True requires 

47 ncomp to be less then the min of the number of rows or columns. 

48 weights : ndarray, optional 

49 Series weights to use after transforming data according to standardize 

50 or demean when computing the principal components. 

51 method : str, optional 

52 Sets the linear algebra routine used to compute eigenvectors 

53 'svd' uses a singular value decomposition (default). 

54 'eig' uses an eigenvalue decomposition of a quadratic form 

55 'nipals' uses the NIPALS algorithm and can be faster than SVD when 

56 ncomp is small and nvars is large. See notes about additional changes 

57 when using NIPALS. 

58 missing : str 

59 Method for missing data. Choices are: 

60 'drop-row' - drop rows with missing values. 

61 'drop-col' - drop columns with missing values. 

62 'drop-min' - drop either rows or columns, choosing by data retention. 

63 'fill-em' - use EM algorithm to fill missing value. ncomp should be 

64 set to the number of factors required. 

65 tol : float, optional 

66 Tolerance to use when checking for convergence when using NIPALS. 

67 max_iter : int, optional 

68 Maximum iterations when using NIPALS. 

69 tol_em : float 

70 Tolerance to use when checking for convergence of the EM algorithm. 

71 max_em_iter : int 

72 Maximum iterations for the EM algorithm. 

73 

74 Attributes 

75 ---------- 

76 factors : array or DataFrame 

77 nobs by ncomp array of of principal components (scores) 

78 scores : array or DataFrame 

79 nobs by ncomp array of of principal components - identical to factors 

80 loadings : array or DataFrame 

81 ncomp by nvar array of principal component loadings for constructing 

82 the factors 

83 coeff : array or DataFrame 

84 nvar by ncomp array of principal component loadings for constructing 

85 the projections 

86 projection : array or DataFrame 

87 nobs by var array containing the projection of the data onto the ncomp 

88 estimated factors 

89 rsquare : array or Series 

90 ncomp array where the element in the ith position is the R-square 

91 of including the fist i principal components. Note: values are 

92 calculated on the transformed data, not the original data 

93 ic : array or DataFrame 

94 ncomp by 3 array containing the Bai and Ng (2003) Information 

95 criteria. Each column is a different criteria, and each row 

96 represents the number of included factors. 

97 eigenvals : array or Series 

98 nvar array of eigenvalues 

99 eigenvecs : array or DataFrame 

100 nvar by nvar array of eigenvectors 

101 weights : ndarray 

102 nvar array of weights used to compute the principal components, 

103 normalized to unit length 

104 transformed_data : ndarray 

105 Standardized, demeaned and weighted data used to compute 

106 principal components and related quantities 

107 cols : ndarray 

108 Array of indices indicating columns used in the PCA 

109 rows : ndarray 

110 Array of indices indicating rows used in the PCA 

111 

112 Notes 

113 ----- 

114 The default options perform principal component analysis on the 

115 demeaned, unit variance version of data. Setting standardize to False 

116 will instead only demean, and setting both standardized and 

117 demean to False will not alter the data. 

118 

119 Once the data have been transformed, the following relationships hold when 

120 the number of components (ncomp) is the same as tne minimum of the number 

121 of observation or the number of variables. 

122 

123 .. math: 

124 

125 X' X = V \\Lambda V' 

126 

127 .. math: 

128 

129 F = X V 

130 

131 .. math: 

132 

133 X = F V' 

134 

135 where X is the `data`, F is the array of principal components (`factors` 

136 or `scores`), and V is the array of eigenvectors (`loadings`) and V' is 

137 the array of factor coefficients (`coeff`). 

138 

139 When weights are provided, the principal components are computed from the 

140 modified data 

141 

142 .. math: 

143 

144 \\Omega^{-\\frac{1}{2}} X 

145 

146 where :math:`\\Omega` is a diagonal matrix composed of the weights. For 

147 example, when using the GLS version of PCA, the elements of :math:`\\Omega` 

148 will be the inverse of the variances of the residuals from 

149 

150 .. math: 

151 

152 X - F V' 

153 

154 where the number of factors is less than the rank of X 

155 

156 References 

157 ---------- 

158 .. [*] J. Bai and S. Ng, "Determining the number of factors in approximate 

159 factor models," Econometrica, vol. 70, number 1, pp. 191-221, 2002 

160 

161 Examples 

162 -------- 

163 Basic PCA using the correlation matrix of the data 

164 

165 >>> import numpy as np 

166 >>> from statsmodels.multivariate.pca import PCA 

167 >>> x = np.random.randn(100)[:, None] 

168 >>> x = x + np.random.randn(100, 100) 

169 >>> pc = PCA(x) 

170 

171 Note that the principal components are computed using a SVD and so the 

172 correlation matrix is never constructed, unless method='eig'. 

173 

174 PCA using the covariance matrix of the data 

175 

176 >>> pc = PCA(x, standardize=False) 

177 

178 Limiting the number of factors returned to 1 computed using NIPALS 

179 

180 >>> pc = PCA(x, ncomp=1, method='nipals') 

181 >>> pc.factors.shape 

182 (100, 1) 

183 """ 

184 

185 def __init__(self, data, ncomp=None, standardize=True, demean=True, 

186 normalize=True, gls=False, weights=None, method='svd', 

187 missing=None, tol=5e-8, max_iter=1000, tol_em=5e-8, 

188 max_em_iter=100): 

189 self._index = None 

190 self._columns = [] 

191 if isinstance(data, pd.DataFrame): 

192 self._index = data.index 

193 self._columns = data.columns 

194 

195 self.data = np.asarray(data) 

196 # Store inputs 

197 self._gls = gls 

198 self._normalize = normalize 

199 self._tol = tol 

200 if not 0 < self._tol < 1: 

201 raise ValueError('tol must be strictly between 0 and 1') 

202 self._max_iter = max_iter 

203 self._max_em_iter = max_em_iter 

204 self._tol_em = tol_em 

205 

206 # Prepare data 

207 self._standardize = standardize 

208 self._demean = demean 

209 

210 self._nobs, self._nvar = self.data.shape 

211 if weights is None: 

212 weights = np.ones(self._nvar) 

213 else: 

214 weights = np.array(weights).flatten() 

215 if weights.shape[0] != self._nvar: 

216 raise ValueError('weights should have nvar elements') 

217 weights = weights / np.sqrt((weights ** 2.0).mean()) 

218 self.weights = weights 

219 

220 # Check ncomp against maximum 

221 min_dim = min(self._nobs, self._nvar) 

222 self._ncomp = min_dim if ncomp is None else ncomp 

223 if self._ncomp > min_dim: 

224 import warnings 

225 

226 warn = 'The requested number of components is more than can be ' \ 

227 'computed from data. The maximum number of components is ' \ 

228 'the minimum of the number of observations or variables' 

229 warnings.warn(warn, ValueWarning) 

230 self._ncomp = min_dim 

231 

232 self._method = method 

233 # Workaround to avoid instance methods in __dict__ 

234 if self._method not in ('eig', 'svd', 'nipals'): 

235 raise ValueError('method {0} is not known.'.format(method)) 

236 

237 self.rows = np.arange(self._nobs) 

238 self.cols = np.arange(self._nvar) 

239 # Handle missing 

240 self._missing = missing 

241 self._adjusted_data = self.data 

242 if missing is not None: 

243 self._adjust_missing() 

244 # Update size 

245 self._nobs, self._nvar = self._adjusted_data.shape 

246 if self._ncomp == np.min(self.data.shape): 

247 self._ncomp = np.min(self._adjusted_data.shape) 

248 elif self._ncomp > np.min(self._adjusted_data.shape): 

249 raise ValueError('When adjusting for missing values, user ' 

250 'provided ncomp must be no larger than the ' 

251 'smallest dimension of the ' 

252 'missing-value-adjusted data size.') 

253 

254 # Attributes and internal values 

255 self._tss = 0.0 

256 self._ess = None 

257 self.transformed_data = None 

258 self._mu = None 

259 self._sigma = None 

260 self._ess_indiv = None 

261 self._tss_indiv = None 

262 self.scores = self.factors = None 

263 self.loadings = None 

264 self.coeff = None 

265 self.eigenvals = None 

266 self.eigenvecs = None 

267 self.projection = None 

268 self.rsquare = None 

269 self.ic = None 

270 

271 # Prepare data 

272 self.transformed_data = self._prepare_data() 

273 # Perform the PCA 

274 self._pca() 

275 if gls: 

276 self._compute_gls_weights() 

277 self.transformed_data = self._prepare_data() 

278 self._pca() 

279 

280 # Final calculations 

281 self._compute_rsquare_and_ic() 

282 if self._index is not None: 

283 self._to_pandas() 

284 

285 def _adjust_missing(self): 

286 """ 

287 Implements alternatives for handling missing values 

288 """ 

289 

290 def keep_col(x): 

291 index = np.logical_not(np.any(np.isnan(x), 0)) 

292 return x[:, index], index 

293 

294 def keep_row(x): 

295 index = np.logical_not(np.any(np.isnan(x), 1)) 

296 return x[index, :], index 

297 

298 if self._missing == 'drop-col': 

299 self._adjusted_data, index = keep_col(self.data) 

300 self.cols = np.where(index)[0] 

301 self.weights = self.weights[index] 

302 elif self._missing == 'drop-row': 

303 self._adjusted_data, index = keep_row(self.data) 

304 self.rows = np.where(index)[0] 

305 elif self._missing == 'drop-min': 

306 drop_col, drop_col_index = keep_col(self.data) 

307 drop_col_size = drop_col.size 

308 

309 drop_row, drop_row_index = keep_row(self.data) 

310 drop_row_size = drop_row.size 

311 

312 if drop_row_size > drop_col_size: 

313 self._adjusted_data = drop_row 

314 self.rows = np.where(drop_row_index)[0] 

315 else: 

316 self._adjusted_data = drop_col 

317 self.weights = self.weights[drop_col_index] 

318 self.cols = np.where(drop_col_index)[0] 

319 elif self._missing == 'fill-em': 

320 self._adjusted_data = self._fill_missing_em() 

321 else: 

322 raise ValueError('missing method is not known.') 

323 

324 if self._index is not None: 

325 self._columns = self._columns[self.cols] 

326 self._index = self._index[self.rows] 

327 

328 # Check adjusted data size 

329 if self._adjusted_data.size == 0: 

330 raise ValueError('Removal of missing values has eliminated ' 

331 'all data.') 

332 

333 def _compute_gls_weights(self): 

334 """ 

335 Computes GLS weights based on percentage of data fit 

336 """ 

337 errors = self.transformed_data - np.asarray(self.projection) 

338 if self._ncomp == self._nvar: 

339 raise ValueError('gls can only be used when ncomp < nvar ' 

340 'so that residuals have non-zero variance') 

341 var = (errors ** 2.0).mean(0) 

342 weights = 1.0 / var 

343 weights = weights / np.sqrt((weights ** 2.0).mean()) 

344 nvar = self._nvar 

345 eff_series_perc = (1.0 / sum((weights / weights.sum()) ** 2.0)) / nvar 

346 if eff_series_perc < 0.1: 

347 eff_series = int(np.round(eff_series_perc * nvar)) 

348 import warnings 

349 

350 warn = 'Many series are being down weighted by GLS. Of the ' \ 

351 '{original} series, the GLS estimates are based on only ' \ 

352 '{effective} (effective) ' \ 

353 'series.'.format(original=nvar, effective=eff_series) 

354 warnings.warn(warn, EstimationWarning) 

355 

356 self.weights = weights 

357 

358 def _pca(self): 

359 """ 

360 Main PCA routine 

361 """ 

362 self._compute_eig() 

363 self._compute_pca_from_eig() 

364 self.projection = self.project() 

365 

366 def __repr__(self): 

367 string = self.__str__() 

368 string = string[:-1] 

369 string += ', id: ' + hex(id(self)) + ')' 

370 return string 

371 

372 def __str__(self): 

373 string = 'Principal Component Analysis(' 

374 string += 'nobs: ' + str(self._nobs) + ', ' 

375 string += 'nvar: ' + str(self._nvar) + ', ' 

376 if self._standardize: 

377 kind = 'Standardize (Correlation)' 

378 elif self._demean: 

379 kind = 'Demean (Covariance)' 

380 else: 

381 kind = 'None' 

382 string += 'transformation: ' + kind + ', ' 

383 if self._gls: 

384 string += 'GLS, ' 

385 string += 'normalization: ' + str(self._normalize) + ', ' 

386 string += 'number of components: ' + str(self._ncomp) + ', ' 

387 string += 'method: ' + 'Eigenvalue' if self._method == 'eig' else 'SVD' 

388 string += ')' 

389 return string 

390 

391 def _prepare_data(self): 

392 """ 

393 Standardize or demean data. 

394 """ 

395 adj_data = self._adjusted_data 

396 if np.all(np.isnan(adj_data)): 

397 return np.empty(adj_data.shape[1]).fill(np.nan) 

398 

399 self._mu = np.nanmean(adj_data, axis=0) 

400 self._sigma = np.sqrt(np.nanmean((adj_data - self._mu) ** 2.0, axis=0)) 

401 if self._standardize: 

402 data = (adj_data - self._mu) / self._sigma 

403 elif self._demean: 

404 data = (adj_data - self._mu) 

405 else: 

406 data = adj_data 

407 return data / np.sqrt(self.weights) 

408 

409 def _compute_eig(self): 

410 """ 

411 Wrapper for actual eigenvalue method 

412 

413 This is a workaround to avoid instance methods in __dict__ 

414 """ 

415 if self._method == 'eig': 

416 return self._compute_using_eig() 

417 elif self._method == 'svd': 

418 return self._compute_using_svd() 

419 else: # self._method == 'nipals' 

420 return self._compute_using_nipals() 

421 

422 def _compute_using_svd(self): 

423 """SVD method to compute eigenvalues and eigenvecs""" 

424 x = self.transformed_data 

425 u, s, v = np.linalg.svd(x) 

426 self.eigenvals = s ** 2.0 

427 self.eigenvecs = v.T 

428 

429 def _compute_using_eig(self): 

430 """ 

431 Eigenvalue decomposition method to compute eigenvalues and eigenvectors 

432 """ 

433 x = self.transformed_data 

434 self.eigenvals, self.eigenvecs = np.linalg.eigh(x.T.dot(x)) 

435 

436 def _compute_using_nipals(self): 

437 """ 

438 NIPALS implementation to compute small number of eigenvalues 

439 and eigenvectors 

440 """ 

441 x = self.transformed_data 

442 if self._ncomp > 1: 

443 x = x + 0.0 # Copy 

444 

445 tol, max_iter, ncomp = self._tol, self._max_iter, self._ncomp 

446 vals = np.zeros(self._ncomp) 

447 vecs = np.zeros((self._nvar, self._ncomp)) 

448 for i in range(ncomp): 

449 max_var_ind = np.argmax(x.var(0)) 

450 factor = x[:, [max_var_ind]] 

451 _iter = 0 

452 diff = 1.0 

453 while diff > tol and _iter < max_iter: 

454 vec = x.T.dot(factor) / (factor.T.dot(factor)) 

455 vec = vec / np.sqrt(vec.T.dot(vec)) 

456 factor_last = factor 

457 factor = x.dot(vec) / (vec.T.dot(vec)) 

458 diff = _norm(factor - factor_last) / _norm(factor) 

459 _iter += 1 

460 vals[i] = (factor ** 2).sum() 

461 vecs[:, [i]] = vec 

462 if ncomp > 1: 

463 x -= factor.dot(vec.T) 

464 

465 self.eigenvals = vals 

466 self.eigenvecs = vecs 

467 

468 def _fill_missing_em(self): 

469 """ 

470 EM algorithm to fill missing values 

471 """ 

472 non_missing = np.logical_not(np.isnan(self.data)) 

473 

474 # If nothing missing, return without altering the data 

475 if np.all(non_missing): 

476 return self.data 

477 

478 # 1. Standardized data as needed 

479 data = self.transformed_data = np.asarray(self._prepare_data()) 

480 

481 ncomp = self._ncomp 

482 

483 # 2. Check for all nans 

484 col_non_missing = np.sum(non_missing, 1) 

485 row_non_missing = np.sum(non_missing, 0) 

486 if np.any(col_non_missing < ncomp) or np.any(row_non_missing < ncomp): 

487 raise ValueError('Implementation requires that all columns and ' 

488 'all rows have at least ncomp non-missing values') 

489 # 3. Get mask 

490 mask = np.isnan(data) 

491 

492 # 4. Compute mean 

493 mu = np.nanmean(data, 0) 

494 

495 # 5. Replace missing with mean 

496 projection = np.ones((self._nobs, 1)) * mu 

497 projection_masked = projection[mask] 

498 data[mask] = projection_masked 

499 

500 # 6. Compute eigenvalues and fit 

501 diff = 1.0 

502 _iter = 0 

503 while diff > self._tol_em and _iter < self._max_em_iter: 

504 last_projection_masked = projection_masked 

505 # Set transformed data to compute eigenvalues 

506 self.transformed_data = data 

507 # Call correct eig function here 

508 self._compute_eig() 

509 # Call function to compute factors and projection 

510 self._compute_pca_from_eig() 

511 projection = np.asarray(self.project(transform=False, 

512 unweight=False)) 

513 projection_masked = projection[mask] 

514 data[mask] = projection_masked 

515 delta = last_projection_masked - projection_masked 

516 diff = _norm(delta) / _norm(projection_masked) 

517 _iter += 1 

518 # Must copy to avoid overwriting original data since replacing values 

519 data = self._adjusted_data + 0.0 

520 projection = np.asarray(self.project()) 

521 data[mask] = projection[mask] 

522 

523 return data 

524 

525 def _compute_pca_from_eig(self): 

526 """ 

527 Compute relevant statistics after eigenvalues have been computed 

528 """ 

529 # Ensure sorted largest to smallest 

530 vals, vecs = self.eigenvals, self.eigenvecs 

531 indices = np.argsort(vals) 

532 indices = indices[::-1] 

533 vals = vals[indices] 

534 vecs = vecs[:, indices] 

535 if (vals <= 0).any(): 

536 # Discard and warn 

537 num_good = vals.shape[0] - (vals <= 0).sum() 

538 if num_good < self._ncomp: 

539 import warnings 

540 

541 warnings.warn('Only {num:d} eigenvalues are positive. ' 

542 'This is the maximum number of components ' 

543 'that can be extracted.'.format(num=num_good), 

544 EstimationWarning) 

545 

546 self._ncomp = num_good 

547 vals[num_good:] = np.finfo(np.float64).tiny 

548 # Use ncomp for the remaining calculations 

549 vals = vals[:self._ncomp] 

550 vecs = vecs[:, :self._ncomp] 

551 self.eigenvals, self.eigenvecs = vals, vecs 

552 # Select correct number of components to return 

553 self.scores = self.factors = self.transformed_data.dot(vecs) 

554 self.loadings = vecs 

555 self.coeff = vecs.T 

556 if self._normalize: 

557 self.coeff = (self.coeff.T * np.sqrt(vals)).T 

558 self.factors /= np.sqrt(vals) 

559 self.scores = self.factors 

560 

561 def _compute_rsquare_and_ic(self): 

562 """ 

563 Final statistics to compute 

564 """ 

565 # TSS and related calculations 

566 # TODO: This needs careful testing, with and without weights, 

567 # gls, standardized and demean 

568 weights = self.weights 

569 ss_data = self.transformed_data * np.sqrt(weights) 

570 self._tss_indiv = np.sum(ss_data ** 2, 0) 

571 self._tss = np.sum(self._tss_indiv) 

572 self._ess = np.zeros(self._ncomp + 1) 

573 self._ess_indiv = np.zeros((self._ncomp + 1, self._nvar)) 

574 for i in range(self._ncomp + 1): 

575 # Projection in the same space as transformed_data 

576 projection = self.project(ncomp=i, transform=False, unweight=False) 

577 indiv_rss = (projection ** 2).sum(axis=0) 

578 rss = indiv_rss.sum() 

579 self._ess[i] = self._tss - rss 

580 self._ess_indiv[i, :] = self._tss_indiv - indiv_rss 

581 self.rsquare = 1.0 - self._ess / self._tss 

582 # Information Criteria 

583 ess = self._ess 

584 invalid = ess <= 0 # Prevent log issues of 0 

585 if invalid.any(): 

586 last_obs = (np.where(invalid)[0]).min() 

587 ess = ess[:last_obs] 

588 

589 log_ess = np.log(ess) 

590 r = np.arange(ess.shape[0]) 

591 

592 nobs, nvar = self._nobs, self._nvar 

593 sum_to_prod = (nobs + nvar) / (nobs * nvar) 

594 min_dim = min(nobs, nvar) 

595 penalties = np.array([sum_to_prod * np.log(1.0 / sum_to_prod), 

596 sum_to_prod * np.log(min_dim), 

597 np.log(min_dim) / min_dim]) 

598 penalties = penalties[:, None] 

599 ic = log_ess + r * penalties 

600 self.ic = ic.T 

601 

602 def project(self, ncomp=None, transform=True, unweight=True): 

603 """ 

604 Project series onto a specific number of factors. 

605 

606 Parameters 

607 ---------- 

608 ncomp : int, optional 

609 Number of components to use. If omitted, all components 

610 initially computed are used. 

611 transform : bool, optional 

612 Flag indicating whether to return the projection in the original 

613 space of the data (True, default) or in the space of the 

614 standardized/demeaned data. 

615 unweight : bool, optional 

616 Flag indicating whether to undo the effects of the estimation 

617 weights. 

618 

619 Returns 

620 ------- 

621 array_like 

622 The nobs by nvar array of the projection onto ncomp factors. 

623 

624 Notes 

625 ----- 

626 """ 

627 # Projection needs to be scaled/shifted based on inputs 

628 ncomp = self._ncomp if ncomp is None else ncomp 

629 if ncomp > self._ncomp: 

630 raise ValueError('ncomp must be smaller than the number of ' 

631 'components computed.') 

632 factors = np.asarray(self.factors) 

633 coeff = np.asarray(self.coeff) 

634 

635 projection = factors[:, :ncomp].dot(coeff[:ncomp, :]) 

636 if transform or unweight: 

637 projection *= np.sqrt(self.weights) 

638 if transform: 

639 # Remove the weights, which do not depend on transformation 

640 if self._standardize: 

641 projection *= self._sigma 

642 if self._standardize or self._demean: 

643 projection += self._mu 

644 if self._index is not None: 

645 projection = pd.DataFrame(projection, 

646 columns=self._columns, 

647 index=self._index) 

648 return projection 

649 

650 def _to_pandas(self): 

651 """ 

652 Returns pandas DataFrames for all values 

653 """ 

654 index = self._index 

655 # Principal Components 

656 num_zeros = np.ceil(np.log10(self._ncomp)) 

657 comp_str = 'comp_{0:0' + str(int(num_zeros)) + 'd}' 

658 cols = [comp_str.format(i) for i in range(self._ncomp)] 

659 df = pd.DataFrame(self.factors, columns=cols, index=index) 

660 self.scores = self.factors = df 

661 # Projections 

662 df = pd.DataFrame(self.projection, 

663 columns=self._columns, 

664 index=index) 

665 self.projection = df 

666 # Weights 

667 df = pd.DataFrame(self.coeff, index=cols, 

668 columns=self._columns) 

669 self.coeff = df 

670 # Loadings 

671 df = pd.DataFrame(self.loadings, 

672 index=self._columns, columns=cols) 

673 self.loadings = df 

674 # eigenvals 

675 self.eigenvals = pd.Series(self.eigenvals) 

676 self.eigenvals.name = 'eigenvals' 

677 # eigenvecs 

678 vec_str = comp_str.replace('comp', 'eigenvec') 

679 cols = [vec_str.format(i) for i in range(self.eigenvecs.shape[1])] 

680 self.eigenvecs = pd.DataFrame(self.eigenvecs, columns=cols) 

681 # R2 

682 self.rsquare = pd.Series(self.rsquare) 

683 self.rsquare.index.name = 'ncomp' 

684 self.rsquare.name = 'rsquare' 

685 # IC 

686 self.ic = pd.DataFrame(self.ic, columns=['IC_p1', 'IC_p2', 'IC_p3']) 

687 self.ic.index.name = 'ncomp' 

688 

689 def plot_scree(self, ncomp=None, log_scale=True, 

690 cumulative=False, ax=None): 

691 """ 

692 Plot of the ordered eigenvalues 

693 

694 Parameters 

695 ---------- 

696 ncomp : int, optional 

697 Number of components ot include in the plot. If None, will 

698 included the same as the number of components computed 

699 log_scale : boot, optional 

700 Flag indicating whether ot use a log scale for the y-axis 

701 cumulative : bool, optional 

702 Flag indicating whether to plot the eigenvalues or cumulative 

703 eigenvalues 

704 ax : AxesSubplot, optional 

705 An axes on which to draw the graph. If omitted, new a figure 

706 is created 

707 

708 Returns 

709 ------- 

710 matplotlib.figure.Figure 

711 The handle to the figure. 

712 """ 

713 import statsmodels.graphics.utils as gutils 

714 

715 fig, ax = gutils.create_mpl_ax(ax) 

716 

717 ncomp = self._ncomp if ncomp is None else ncomp 

718 vals = np.asarray(self.eigenvals) 

719 vals = vals[:self._ncomp] 

720 if cumulative: 

721 vals = np.cumsum(vals) 

722 

723 if log_scale: 

724 ax.set_yscale('log') 

725 ax.plot(np.arange(ncomp), vals[: ncomp], 'bo') 

726 ax.autoscale(tight=True) 

727 xlim = np.array(ax.get_xlim()) 

728 sp = xlim[1] - xlim[0] 

729 xlim += 0.02 * np.array([-sp, sp]) 

730 ax.set_xlim(xlim) 

731 

732 ylim = np.array(ax.get_ylim()) 

733 scale = 0.02 

734 if log_scale: 

735 sp = np.log(ylim[1] / ylim[0]) 

736 ylim = np.exp(np.array([np.log(ylim[0]) - scale * sp, 

737 np.log(ylim[1]) + scale * sp])) 

738 else: 

739 sp = ylim[1] - ylim[0] 

740 ylim += scale * np.array([-sp, sp]) 

741 ax.set_ylim(ylim) 

742 ax.set_title('Scree Plot') 

743 ax.set_ylabel('Eigenvalue') 

744 ax.set_xlabel('Component Number') 

745 fig.tight_layout() 

746 

747 return fig 

748 

749 def plot_rsquare(self, ncomp=None, ax=None): 

750 """ 

751 Box plots of the individual series R-square against the number of PCs. 

752 

753 Parameters 

754 ---------- 

755 ncomp : int, optional 

756 Number of components ot include in the plot. If None, will 

757 plot the minimum of 10 or the number of computed components. 

758 ax : AxesSubplot, optional 

759 An axes on which to draw the graph. If omitted, new a figure 

760 is created. 

761 

762 Returns 

763 ------- 

764 matplotlib.figure.Figure 

765 The handle to the figure. 

766 """ 

767 import statsmodels.graphics.utils as gutils 

768 

769 fig, ax = gutils.create_mpl_ax(ax) 

770 

771 ncomp = 10 if ncomp is None else ncomp 

772 ncomp = min(ncomp, self._ncomp) 

773 # R2s in rows, series in columns 

774 r2s = 1.0 - self._ess_indiv / self._tss_indiv 

775 r2s = r2s[1:] 

776 r2s = r2s[:ncomp] 

777 ax.boxplot(r2s.T) 

778 ax.set_title('Individual Input $R^2$') 

779 ax.set_ylabel('$R^2$') 

780 ax.set_xlabel('Number of Included Principal Components') 

781 

782 return fig 

783 

784 

785def pca(data, ncomp=None, standardize=True, demean=True, normalize=True, 

786 gls=False, weights=None, method='svd'): 

787 """ 

788 Perform Principal Component Analysis (PCA). 

789 

790 Parameters 

791 ---------- 

792 data : ndarray 

793 Variables in columns, observations in rows. 

794 ncomp : int, optional 

795 Number of components to return. If None, returns the as many as the 

796 smaller to the number of rows or columns of data. 

797 standardize : bool, optional 

798 Flag indicating to use standardized data with mean 0 and unit 

799 variance. standardized being True implies demean. 

800 demean : bool, optional 

801 Flag indicating whether to demean data before computing principal 

802 components. demean is ignored if standardize is True. 

803 normalize : bool , optional 

804 Indicates whether th normalize the factors to have unit inner 

805 product. If False, the loadings will have unit inner product. 

806 gls : bool, optional 

807 Flag indicating to implement a two-step GLS estimator where 

808 in the first step principal components are used to estimate residuals, 

809 and then the inverse residual variance is used as a set of weights to 

810 estimate the final principal components 

811 weights : ndarray, optional 

812 Series weights to use after transforming data according to standardize 

813 or demean when computing the principal components. 

814 method : str, optional 

815 Determines the linear algebra routine uses. 'eig', the default, 

816 uses an eigenvalue decomposition. 'svd' uses a singular value 

817 decomposition. 

818 

819 Returns 

820 ------- 

821 factors : {ndarray, DataFrame} 

822 Array (nobs, ncomp)of of principal components (also known as scores). 

823 loadings : {ndarray, DataFrame} 

824 Array (ncomp, nvar) of principal component loadings for constructing 

825 the factors. 

826 projection : {ndarray, DataFrame} 

827 Array (nobs, nvar) containing the projection of the data onto the ncomp 

828 estimated factors. 

829 rsquare : {ndarray, Series} 

830 Array (ncomp,) where the element in the ith position is the R-square 

831 of including the fist i principal components. The values are 

832 calculated on the transformed data, not the original data. 

833 ic : {ndarray, DataFrame} 

834 Array (ncomp, 3) containing the Bai and Ng (2003) Information 

835 criteria. Each column is a different criteria, and each row 

836 represents the number of included factors. 

837 eigenvals : {ndarray, Series} 

838 Array of eigenvalues (nvar,). 

839 eigenvecs : {ndarray, DataFrame} 

840 Array of eigenvectors. (nvar, nvar). 

841 

842 Notes 

843 ----- 

844 This is a simple function wrapper around the PCA class. See PCA for 

845 more information and additional methods. 

846 """ 

847 pc = PCA(data, ncomp=ncomp, standardize=standardize, demean=demean, 

848 normalize=normalize, gls=gls, weights=weights, method=method) 

849 

850 return (pc.factors, pc.loadings, pc.projection, pc.rsquare, pc.ic, 

851 pc.eigenvals, pc.eigenvecs)