Hide keyboard shortcuts

Hot-keys on this page

r m x p   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

1# -*- coding: utf-8 -*- 

2"""Lowess - wrapper for cythonized extension 

3 

4Author : Chris Jordan-Squire 

5Author : Carl Vogel 

6Author : Josef Perktold 

7 

8""" 

9 

10import numpy as np 

11from ._smoothers_lowess import lowess as _lowess 

12 

13def lowess(endog, exog, frac=2.0/3.0, it=3, delta=0.0, is_sorted=False, 

14 missing='drop', return_sorted=True): 

15 '''LOWESS (Locally Weighted Scatterplot Smoothing) 

16 

17 A lowess function that outs smoothed estimates of endog 

18 at the given exog values from points (exog, endog) 

19 

20 Parameters 

21 ---------- 

22 endog : 1-D numpy array 

23 The y-values of the observed points 

24 exog : 1-D numpy array 

25 The x-values of the observed points 

26 frac : float 

27 Between 0 and 1. The fraction of the data used 

28 when estimating each y-value. 

29 it : int 

30 The number of residual-based reweightings 

31 to perform. 

32 delta : float 

33 Distance within which to use linear-interpolation 

34 instead of weighted regression. 

35 is_sorted : bool 

36 If False (default), then the data will be sorted by exog before 

37 calculating lowess. If True, then it is assumed that the data is 

38 already sorted by exog. 

39 missing : str 

40 Available options are 'none', 'drop', and 'raise'. If 'none', no nan 

41 checking is done. If 'drop', any observations with nans are dropped. 

42 If 'raise', an error is raised. Default is 'drop'. 

43 return_sorted : bool 

44 If True (default), then the returned array is sorted by exog and has 

45 missing (nan or infinite) observations removed. 

46 If False, then the returned array is in the same length and the same 

47 sequence of observations as the input array. 

48 

49 Returns 

50 ------- 

51 out: ndarray, float 

52 The returned array is two-dimensional if return_sorted is True, and 

53 one dimensional if return_sorted is False. 

54 If return_sorted is True, then a numpy array with two columns. The 

55 first column contains the sorted x (exog) values and the second column 

56 the associated estimated y (endog) values. 

57 If return_sorted is False, then only the fitted values are returned, 

58 and the observations will be in the same order as the input arrays. 

59 

60 Notes 

61 ----- 

62 This lowess function implements the algorithm given in the 

63 reference below using local linear estimates. 

64 

65 Suppose the input data has N points. The algorithm works by 

66 estimating the `smooth` y_i by taking the frac*N closest points 

67 to (x_i,y_i) based on their x values and estimating y_i 

68 using a weighted linear regression. The weight for (x_j,y_j) 

69 is tricube function applied to abs(x_i-x_j). 

70 

71 If it > 1, then further weighted local linear regressions 

72 are performed, where the weights are the same as above 

73 times the _lowess_bisquare function of the residuals. Each iteration 

74 takes approximately the same amount of time as the original fit, 

75 so these iterations are expensive. They are most useful when 

76 the noise has extremely heavy tails, such as Cauchy noise. 

77 Noise with less heavy-tails, such as t-distributions with df>2, 

78 are less problematic. The weights downgrade the influence of 

79 points with large residuals. In the extreme case, points whose 

80 residuals are larger than 6 times the median absolute residual 

81 are given weight 0. 

82 

83 `delta` can be used to save computations. For each `x_i`, regressions 

84 are skipped for points closer than `delta`. The next regression is 

85 fit for the farthest point within delta of `x_i` and all points in 

86 between are estimated by linearly interpolating between the two 

87 regression fits. 

88 

89 Judicious choice of delta can cut computation time considerably 

90 for large data (N > 5000). A good choice is ``delta = 0.01 * range(exog)``. 

91 

92 Some experimentation is likely required to find a good 

93 choice of `frac` and `iter` for a particular dataset. 

94 

95 References 

96 ---------- 

97 Cleveland, W.S. (1979) "Robust Locally Weighted Regression 

98 and Smoothing Scatterplots". Journal of the American Statistical 

99 Association 74 (368): 829-836. 

100 

101 Examples 

102 -------- 

103 The below allows a comparison between how different the fits from 

104 lowess for different values of frac can be. 

105 

106 >>> import numpy as np 

107 >>> import statsmodels.api as sm 

108 >>> lowess = sm.nonparametric.lowess 

109 >>> x = np.random.uniform(low = -2*np.pi, high = 2*np.pi, size=500) 

110 >>> y = np.sin(x) + np.random.normal(size=len(x)) 

111 >>> z = lowess(y, x) 

112 >>> w = lowess(y, x, frac=1./3) 

113 

114 This gives a similar comparison for when it is 0 vs not. 

115 

116 >>> import numpy as np 

117 >>> import scipy.stats as stats 

118 >>> import statsmodels.api as sm 

119 >>> lowess = sm.nonparametric.lowess 

120 >>> x = np.random.uniform(low = -2*np.pi, high = 2*np.pi, size=500) 

121 >>> y = np.sin(x) + stats.cauchy.rvs(size=len(x)) 

122 >>> z = lowess(y, x, frac= 1./3, it=0) 

123 >>> w = lowess(y, x, frac=1./3) 

124 

125 ''' 

126 

127 endog = np.asarray(endog, float) 

128 exog = np.asarray(exog, float) 

129 

130 # Inputs should be vectors (1-D arrays) of the 

131 # same length. 

132 if exog.ndim != 1: 

133 raise ValueError('exog must be a vector') 

134 if endog.ndim != 1: 

135 raise ValueError('endog must be a vector') 

136 if endog.shape[0] != exog.shape[0] : 

137 raise ValueError('exog and endog must have same length') 

138 

139 if missing in ['drop', 'raise']: 

140 # Cut out missing values 

141 mask_valid = (np.isfinite(exog) & np.isfinite(endog)) 

142 all_valid = np.all(mask_valid) 

143 if all_valid: 

144 y = endog 

145 x = exog 

146 else: 

147 if missing == 'drop': 

148 x = exog[mask_valid] 

149 y = endog[mask_valid] 

150 else: 

151 raise ValueError('nan or inf found in data') 

152 elif missing == 'none': 

153 y = endog 

154 x = exog 

155 all_valid = True # we assume it's true if missing='none' 

156 else: 

157 raise ValueError("missing can only be 'none', 'drop' or 'raise'") 

158 

159 if not is_sorted: 

160 # Sort both inputs according to the ascending order of x values 

161 sort_index = np.argsort(x) 

162 x = np.array(x[sort_index]) 

163 y = np.array(y[sort_index]) 

164 

165 res = _lowess(y, x, frac=frac, it=it, delta=delta) 

166 _, yfitted = res.T 

167 

168 if return_sorted: 

169 return res 

170 else: 

171 # rebuild yfitted with original indices 

172 # a bit messy: y might have been selected twice 

173 if not is_sorted: 

174 yfitted_ = np.empty_like(y) 

175 yfitted_.fill(np.nan) 

176 yfitted_[sort_index] = yfitted 

177 yfitted = yfitted_ 

178 else: 

179 yfitted = yfitted 

180 

181 if not all_valid: 

182 yfitted_ = np.empty_like(endog) 

183 yfitted_.fill(np.nan) 

184 yfitted_[mask_valid] = yfitted 

185 yfitted = yfitted_ 

186 

187 # we do not need to return exog anymore 

188 return yfitted