Mercurial > repos > guerler > springsuite
comparison planemo/lib/python3.7/site-packages/boltons/statsutils.py @ 0:d30785e31577 draft
"planemo upload commit 6eee67778febed82ddd413c3ca40b3183a3898f1"
| author | guerler |
|---|---|
| date | Fri, 31 Jul 2020 00:18:57 -0400 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| -1:000000000000 | 0:d30785e31577 |
|---|---|
| 1 # -*- coding: utf-8 -*- | |
| 2 """``statsutils`` provides tools aimed primarily at descriptive | |
| 3 statistics for data analysis, such as :func:`mean` (average), | |
| 4 :func:`median`, :func:`variance`, and many others, | |
| 5 | |
| 6 The :class:`Stats` type provides all the main functionality of the | |
| 7 ``statsutils`` module. A :class:`Stats` object wraps a given dataset, | |
| 8 providing all statistical measures as property attributes. These | |
| 9 attributes cache their results, which allows efficient computation of | |
| 10 multiple measures, as many measures rely on other measures. For | |
| 11 example, relative standard deviation (:attr:`Stats.rel_std_dev`) | |
| 12 relies on both the mean and standard deviation. The Stats object | |
| 13 caches those results so no rework is done. | |
| 14 | |
| 15 The :class:`Stats` type's attributes have module-level counterparts for | |
| 16 convenience when the computation reuse advantages do not apply. | |
| 17 | |
| 18 >>> stats = Stats(range(42)) | |
| 19 >>> stats.mean | |
| 20 20.5 | |
| 21 >>> mean(range(42)) | |
| 22 20.5 | |
| 23 | |
| 24 Statistics is a large field, and ``statsutils`` is focused on a few | |
| 25 basic techniques that are useful in software. The following is a brief | |
| 26 introduction to those techniques. For a more in-depth introduction, | |
| 27 `Statistics for Software | |
| 28 <https://www.paypal-engineering.com/2016/04/11/statistics-for-software/>`_, | |
| 29 an article I wrote on the topic. It introduces key terminology vital | |
| 30 to effective usage of statistics. | |
| 31 | |
| 32 Statistical moments | |
| 33 ------------------- | |
| 34 | |
| 35 Python programmers are probably familiar with the concept of the | |
| 36 *mean* or *average*, which gives a rough quantitiative middle value by | |
| 37 which a sample can be can be generalized. However, the mean is just | |
| 38 the first of four `moment`_-based measures by which a sample or | |
| 39 distribution can be measured. | |
| 40 | |
| 41 The four `Standardized moments`_ are: | |
| 42 | |
| 43 1. `Mean`_ - :func:`mean` - theoretical middle value | |
| 44 2. `Variance`_ - :func:`variance` - width of value dispersion | |
| 45 3. `Skewness`_ - :func:`skewness` - symmetry of distribution | |
| 46 4. `Kurtosis`_ - :func:`kurtosis` - "peakiness" or "long-tailed"-ness | |
| 47 | |
| 48 For more information check out `the Moment article on Wikipedia`_. | |
| 49 | |
| 50 .. _moment: https://en.wikipedia.org/wiki/Moment_(mathematics) | |
| 51 .. _Standardized moments: https://en.wikipedia.org/wiki/Standardized_moment | |
| 52 .. _Mean: https://en.wikipedia.org/wiki/Mean | |
| 53 .. _Variance: https://en.wikipedia.org/wiki/Variance | |
| 54 .. _Skewness: https://en.wikipedia.org/wiki/Skewness | |
| 55 .. _Kurtosis: https://en.wikipedia.org/wiki/Kurtosis | |
| 56 .. _the Moment article on Wikipedia: https://en.wikipedia.org/wiki/Moment_(mathematics) | |
| 57 | |
| 58 Keep in mind that while these moments can give a bit more insight into | |
| 59 the shape and distribution of data, they do not guarantee a complete | |
| 60 picture. Wildly different datasets can have the same values for all | |
| 61 four moments, so generalize wisely. | |
| 62 | |
| 63 Robust statistics | |
| 64 ----------------- | |
| 65 | |
| 66 Moment-based statistics are notorious for being easily skewed by | |
| 67 outliers. The whole field of robust statistics aims to mitigate this | |
| 68 dilemma. ``statsutils`` also includes several robust statistical methods: | |
| 69 | |
| 70 * `Median`_ - The middle value of a sorted dataset | |
| 71 * `Trimean`_ - Another robust measure of the data's central tendency | |
| 72 * `Median Absolute Deviation`_ (MAD) - A robust measure of | |
| 73 variability, a natural counterpart to :func:`variance`. | |
| 74 * `Trimming`_ - Reducing a dataset to only the middle majority of | |
| 75 data is a simple way of making other estimators more robust. | |
| 76 | |
| 77 .. _Median: https://en.wikipedia.org/wiki/Median | |
| 78 .. _Trimean: https://en.wikipedia.org/wiki/Trimean | |
| 79 .. _Median Absolute Deviation: https://en.wikipedia.org/wiki/Median_absolute_deviation | |
| 80 .. _Trimming: https://en.wikipedia.org/wiki/Trimmed_estimator | |
| 81 | |
| 82 | |
| 83 Online and Offline Statistics | |
| 84 ----------------------------- | |
| 85 | |
| 86 Unrelated to computer networking, `online`_ statistics involve | |
| 87 calculating statistics in a `streaming`_ fashion, without all the data | |
| 88 being available. The :class:`Stats` type is meant for the more | |
| 89 traditional offline statistics when all the data is available. For | |
| 90 pure-Python online statistics accumulators, look at the `Lithoxyl`_ | |
| 91 system instrumentation package. | |
| 92 | |
| 93 .. _Online: https://en.wikipedia.org/wiki/Online_algorithm | |
| 94 .. _streaming: https://en.wikipedia.org/wiki/Streaming_algorithm | |
| 95 .. _Lithoxyl: https://github.com/mahmoud/lithoxyl | |
| 96 | |
| 97 """ | |
| 98 | |
| 99 from __future__ import print_function | |
| 100 | |
| 101 import bisect | |
| 102 from math import floor, ceil | |
| 103 | |
| 104 | |
| 105 class _StatsProperty(object): | |
| 106 def __init__(self, name, func): | |
| 107 self.name = name | |
| 108 self.func = func | |
| 109 self.internal_name = '_' + name | |
| 110 | |
| 111 doc = func.__doc__ or '' | |
| 112 pre_doctest_doc, _, _ = doc.partition('>>>') | |
| 113 self.__doc__ = pre_doctest_doc | |
| 114 | |
| 115 def __get__(self, obj, objtype=None): | |
| 116 if obj is None: | |
| 117 return self | |
| 118 if not obj.data: | |
| 119 return obj.default | |
| 120 try: | |
| 121 return getattr(obj, self.internal_name) | |
| 122 except AttributeError: | |
| 123 setattr(obj, self.internal_name, self.func(obj)) | |
| 124 return getattr(obj, self.internal_name) | |
| 125 | |
| 126 | |
| 127 class Stats(object): | |
| 128 """The ``Stats`` type is used to represent a group of unordered | |
| 129 statistical datapoints for calculations such as mean, median, and | |
| 130 variance. | |
| 131 | |
| 132 Args: | |
| 133 | |
| 134 data (list): List or other iterable containing numeric values. | |
| 135 default (float): A value to be returned when a given | |
| 136 statistical measure is not defined. 0.0 by default, but | |
| 137 ``float('nan')`` is appropriate for stricter applications. | |
| 138 use_copy (bool): By default Stats objects copy the initial | |
| 139 data into a new list to avoid issues with | |
| 140 modifications. Pass ``False`` to disable this behavior. | |
| 141 is_sorted (bool): Presorted data can skip an extra sorting | |
| 142 step for a little speed boost. Defaults to False. | |
| 143 | |
| 144 """ | |
| 145 def __init__(self, data, default=0.0, use_copy=True, is_sorted=False): | |
| 146 self._use_copy = use_copy | |
| 147 self._is_sorted = is_sorted | |
| 148 if use_copy: | |
| 149 self.data = list(data) | |
| 150 else: | |
| 151 self.data = data | |
| 152 | |
| 153 self.default = default | |
| 154 cls = self.__class__ | |
| 155 self._prop_attr_names = [a for a in dir(self) | |
| 156 if isinstance(getattr(cls, a, None), | |
| 157 _StatsProperty)] | |
| 158 self._pearson_precision = 0 | |
| 159 | |
| 160 def __len__(self): | |
| 161 return len(self.data) | |
| 162 | |
| 163 def __iter__(self): | |
| 164 return iter(self.data) | |
| 165 | |
| 166 def _get_sorted_data(self): | |
| 167 """When using a copy of the data, it's better to have that copy be | |
| 168 sorted, but we do it lazily using this method, in case no | |
| 169 sorted measures are used. I.e., if median is never called, | |
| 170 sorting would be a waste. | |
| 171 | |
| 172 When not using a copy, it's presumed that all optimizations | |
| 173 are on the user. | |
| 174 """ | |
| 175 if not self._use_copy: | |
| 176 return sorted(self.data) | |
| 177 elif not self._is_sorted: | |
| 178 self.data.sort() | |
| 179 return self.data | |
| 180 | |
| 181 def clear_cache(self): | |
| 182 """``Stats`` objects automatically cache intermediary calculations | |
| 183 that can be reused. For instance, accessing the ``std_dev`` | |
| 184 attribute after the ``variance`` attribute will be | |
| 185 significantly faster for medium-to-large datasets. | |
| 186 | |
| 187 If you modify the object by adding additional data points, | |
| 188 call this function to have the cached statistics recomputed. | |
| 189 | |
| 190 """ | |
| 191 for attr_name in self._prop_attr_names: | |
| 192 attr_name = getattr(self.__class__, attr_name).internal_name | |
| 193 if not hasattr(self, attr_name): | |
| 194 continue | |
| 195 delattr(self, attr_name) | |
| 196 return | |
| 197 | |
| 198 def _calc_count(self): | |
| 199 """The number of items in this Stats object. Returns the same as | |
| 200 :func:`len` on a Stats object, but provided for pandas terminology | |
| 201 parallelism. | |
| 202 | |
| 203 >>> Stats(range(20)).count | |
| 204 20 | |
| 205 """ | |
| 206 return len(self.data) | |
| 207 count = _StatsProperty('count', _calc_count) | |
| 208 | |
| 209 def _calc_mean(self): | |
| 210 """ | |
| 211 The arithmetic mean, or "average". Sum of the values divided by | |
| 212 the number of values. | |
| 213 | |
| 214 >>> mean(range(20)) | |
| 215 9.5 | |
| 216 >>> mean(list(range(19)) + [949]) # 949 is an arbitrary outlier | |
| 217 56.0 | |
| 218 """ | |
| 219 return sum(self.data, 0.0) / len(self.data) | |
| 220 mean = _StatsProperty('mean', _calc_mean) | |
| 221 | |
| 222 def _calc_max(self): | |
| 223 """ | |
| 224 The maximum value present in the data. | |
| 225 | |
| 226 >>> Stats([2, 1, 3]).max | |
| 227 3 | |
| 228 """ | |
| 229 if self._is_sorted: | |
| 230 return self.data[-1] | |
| 231 return max(self.data) | |
| 232 max = _StatsProperty('max', _calc_max) | |
| 233 | |
| 234 def _calc_min(self): | |
| 235 """ | |
| 236 The minimum value present in the data. | |
| 237 | |
| 238 >>> Stats([2, 1, 3]).min | |
| 239 1 | |
| 240 """ | |
| 241 if self._is_sorted: | |
| 242 return self.data[0] | |
| 243 return min(self.data) | |
| 244 min = _StatsProperty('min', _calc_min) | |
| 245 | |
| 246 def _calc_median(self): | |
| 247 """ | |
| 248 The median is either the middle value or the average of the two | |
| 249 middle values of a sample. Compared to the mean, it's generally | |
| 250 more resilient to the presence of outliers in the sample. | |
| 251 | |
| 252 >>> median([2, 1, 3]) | |
| 253 2 | |
| 254 >>> median(range(97)) | |
| 255 48 | |
| 256 >>> median(list(range(96)) + [1066]) # 1066 is an arbitrary outlier | |
| 257 48 | |
| 258 """ | |
| 259 return self._get_quantile(self._get_sorted_data(), 0.5) | |
| 260 median = _StatsProperty('median', _calc_median) | |
| 261 | |
| 262 def _calc_iqr(self): | |
| 263 """Inter-quartile range (IQR) is the difference between the 75th | |
| 264 percentile and 25th percentile. IQR is a robust measure of | |
| 265 dispersion, like standard deviation, but safer to compare | |
| 266 between datasets, as it is less influenced by outliers. | |
| 267 | |
| 268 >>> iqr([1, 2, 3, 4, 5]) | |
| 269 2 | |
| 270 >>> iqr(range(1001)) | |
| 271 500 | |
| 272 """ | |
| 273 return self.get_quantile(0.75) - self.get_quantile(0.25) | |
| 274 iqr = _StatsProperty('iqr', _calc_iqr) | |
| 275 | |
| 276 def _calc_trimean(self): | |
| 277 """The trimean is a robust measure of central tendency, like the | |
| 278 median, that takes the weighted average of the median and the | |
| 279 upper and lower quartiles. | |
| 280 | |
| 281 >>> trimean([2, 1, 3]) | |
| 282 2.0 | |
| 283 >>> trimean(range(97)) | |
| 284 48.0 | |
| 285 >>> trimean(list(range(96)) + [1066]) # 1066 is an arbitrary outlier | |
| 286 48.0 | |
| 287 | |
| 288 """ | |
| 289 sorted_data = self._get_sorted_data() | |
| 290 gq = lambda q: self._get_quantile(sorted_data, q) | |
| 291 return (gq(0.25) + (2 * gq(0.5)) + gq(0.75)) / 4.0 | |
| 292 trimean = _StatsProperty('trimean', _calc_trimean) | |
| 293 | |
| 294 def _calc_variance(self): | |
| 295 """\ | |
| 296 Variance is the average of the squares of the difference between | |
| 297 each value and the mean. | |
| 298 | |
| 299 >>> variance(range(97)) | |
| 300 784.0 | |
| 301 """ | |
| 302 global mean # defined elsewhere in this file | |
| 303 return mean(self._get_pow_diffs(2)) | |
| 304 variance = _StatsProperty('variance', _calc_variance) | |
| 305 | |
| 306 def _calc_std_dev(self): | |
| 307 """\ | |
| 308 Standard deviation. Square root of the variance. | |
| 309 | |
| 310 >>> std_dev(range(97)) | |
| 311 28.0 | |
| 312 """ | |
| 313 return self.variance ** 0.5 | |
| 314 std_dev = _StatsProperty('std_dev', _calc_std_dev) | |
| 315 | |
| 316 def _calc_median_abs_dev(self): | |
| 317 """\ | |
| 318 Median Absolute Deviation is a robust measure of statistical | |
| 319 dispersion: http://en.wikipedia.org/wiki/Median_absolute_deviation | |
| 320 | |
| 321 >>> median_abs_dev(range(97)) | |
| 322 24.0 | |
| 323 """ | |
| 324 global median # defined elsewhere in this file | |
| 325 sorted_vals = sorted(self.data) | |
| 326 x = float(median(sorted_vals)) | |
| 327 return median([abs(x - v) for v in sorted_vals]) | |
| 328 median_abs_dev = _StatsProperty('median_abs_dev', _calc_median_abs_dev) | |
| 329 mad = median_abs_dev # convenience | |
| 330 | |
| 331 def _calc_rel_std_dev(self): | |
| 332 """\ | |
| 333 Standard deviation divided by the absolute value of the average. | |
| 334 | |
| 335 http://en.wikipedia.org/wiki/Relative_standard_deviation | |
| 336 | |
| 337 >>> print('%1.3f' % rel_std_dev(range(97))) | |
| 338 0.583 | |
| 339 """ | |
| 340 abs_mean = abs(self.mean) | |
| 341 if abs_mean: | |
| 342 return self.std_dev / abs_mean | |
| 343 else: | |
| 344 return self.default | |
| 345 rel_std_dev = _StatsProperty('rel_std_dev', _calc_rel_std_dev) | |
| 346 | |
| 347 def _calc_skewness(self): | |
| 348 """\ | |
| 349 Indicates the asymmetry of a curve. Positive values mean the bulk | |
| 350 of the values are on the left side of the average and vice versa. | |
| 351 | |
| 352 http://en.wikipedia.org/wiki/Skewness | |
| 353 | |
| 354 See the module docstring for more about statistical moments. | |
| 355 | |
| 356 >>> skewness(range(97)) # symmetrical around 48.0 | |
| 357 0.0 | |
| 358 >>> left_skewed = skewness(list(range(97)) + list(range(10))) | |
| 359 >>> right_skewed = skewness(list(range(97)) + list(range(87, 97))) | |
| 360 >>> round(left_skewed, 3), round(right_skewed, 3) | |
| 361 (0.114, -0.114) | |
| 362 """ | |
| 363 data, s_dev = self.data, self.std_dev | |
| 364 if len(data) > 1 and s_dev > 0: | |
| 365 return (sum(self._get_pow_diffs(3)) / | |
| 366 float((len(data) - 1) * (s_dev ** 3))) | |
| 367 else: | |
| 368 return self.default | |
| 369 skewness = _StatsProperty('skewness', _calc_skewness) | |
| 370 | |
| 371 def _calc_kurtosis(self): | |
| 372 """\ | |
| 373 Indicates how much data is in the tails of the distribution. The | |
| 374 result is always positive, with the normal "bell-curve" | |
| 375 distribution having a kurtosis of 3. | |
| 376 | |
| 377 http://en.wikipedia.org/wiki/Kurtosis | |
| 378 | |
| 379 See the module docstring for more about statistical moments. | |
| 380 | |
| 381 >>> kurtosis(range(9)) | |
| 382 1.99125 | |
| 383 | |
| 384 With a kurtosis of 1.99125, [0, 1, 2, 3, 4, 5, 6, 7, 8] is more | |
| 385 centrally distributed than the normal curve. | |
| 386 """ | |
| 387 data, s_dev = self.data, self.std_dev | |
| 388 if len(data) > 1 and s_dev > 0: | |
| 389 return (sum(self._get_pow_diffs(4)) / | |
| 390 float((len(data) - 1) * (s_dev ** 4))) | |
| 391 else: | |
| 392 return 0.0 | |
| 393 kurtosis = _StatsProperty('kurtosis', _calc_kurtosis) | |
| 394 | |
| 395 def _calc_pearson_type(self): | |
| 396 precision = self._pearson_precision | |
| 397 skewness = self.skewness | |
| 398 kurtosis = self.kurtosis | |
| 399 beta1 = skewness ** 2.0 | |
| 400 beta2 = kurtosis * 1.0 | |
| 401 | |
| 402 # TODO: range checks? | |
| 403 | |
| 404 c0 = (4 * beta2) - (3 * beta1) | |
| 405 c1 = skewness * (beta2 + 3) | |
| 406 c2 = (2 * beta2) - (3 * beta1) - 6 | |
| 407 | |
| 408 if round(c1, precision) == 0: | |
| 409 if round(beta2, precision) == 3: | |
| 410 return 0 # Normal | |
| 411 else: | |
| 412 if beta2 < 3: | |
| 413 return 2 # Symmetric Beta | |
| 414 elif beta2 > 3: | |
| 415 return 7 | |
| 416 elif round(c2, precision) == 0: | |
| 417 return 3 # Gamma | |
| 418 else: | |
| 419 k = c1 ** 2 / (4 * c0 * c2) | |
| 420 if k < 0: | |
| 421 return 1 # Beta | |
| 422 raise RuntimeError('missed a spot') | |
| 423 pearson_type = _StatsProperty('pearson_type', _calc_pearson_type) | |
| 424 | |
| 425 @staticmethod | |
| 426 def _get_quantile(sorted_data, q): | |
| 427 data, n = sorted_data, len(sorted_data) | |
| 428 idx = q / 1.0 * (n - 1) | |
| 429 idx_f, idx_c = int(floor(idx)), int(ceil(idx)) | |
| 430 if idx_f == idx_c: | |
| 431 return data[idx_f] | |
| 432 return (data[idx_f] * (idx_c - idx)) + (data[idx_c] * (idx - idx_f)) | |
| 433 | |
| 434 def get_quantile(self, q): | |
| 435 """Get a quantile from the dataset. Quantiles are floating point | |
| 436 values between ``0.0`` and ``1.0``, with ``0.0`` representing | |
| 437 the minimum value in the dataset and ``1.0`` representing the | |
| 438 maximum. ``0.5`` represents the median: | |
| 439 | |
| 440 >>> Stats(range(100)).get_quantile(0.5) | |
| 441 49.5 | |
| 442 """ | |
| 443 q = float(q) | |
| 444 if not 0.0 <= q <= 1.0: | |
| 445 raise ValueError('expected q between 0.0 and 1.0, not %r' % q) | |
| 446 elif not self.data: | |
| 447 return self.default | |
| 448 return self._get_quantile(self._get_sorted_data(), q) | |
| 449 | |
| 450 def get_zscore(self, value): | |
| 451 """Get the z-score for *value* in the group. If the standard deviation | |
| 452 is 0, 0 inf or -inf will be returned to indicate whether the value is | |
| 453 equal to, greater than or below the group's mean. | |
| 454 """ | |
| 455 mean = self.mean | |
| 456 if self.std_dev == 0: | |
| 457 if value == mean: | |
| 458 return 0 | |
| 459 if value > mean: | |
| 460 return float('inf') | |
| 461 if value < mean: | |
| 462 return float('-inf') | |
| 463 return (float(value) - mean) / self.std_dev | |
| 464 | |
| 465 def trim_relative(self, amount=0.15): | |
| 466 """A utility function used to cut a proportion of values off each end | |
| 467 of a list of values. This has the effect of limiting the | |
| 468 effect of outliers. | |
| 469 | |
| 470 Args: | |
| 471 amount (float): A value between 0.0 and 0.5 to trim off of | |
| 472 each side of the data. | |
| 473 | |
| 474 .. note: | |
| 475 | |
| 476 This operation modifies the data in-place. It does not | |
| 477 make or return a copy. | |
| 478 | |
| 479 """ | |
| 480 trim = float(amount) | |
| 481 if not 0.0 <= trim < 0.5: | |
| 482 raise ValueError('expected amount between 0.0 and 0.5, not %r' | |
| 483 % trim) | |
| 484 size = len(self.data) | |
| 485 size_diff = int(size * trim) | |
| 486 if size_diff == 0.0: | |
| 487 return | |
| 488 self.data = self._get_sorted_data()[size_diff:-size_diff] | |
| 489 self.clear_cache() | |
| 490 | |
| 491 def _get_pow_diffs(self, power): | |
| 492 """ | |
| 493 A utility function used for calculating statistical moments. | |
| 494 """ | |
| 495 m = self.mean | |
| 496 return [(v - m) ** power for v in self.data] | |
| 497 | |
| 498 def _get_bin_bounds(self, count=None, with_max=False): | |
| 499 if not self.data: | |
| 500 return [0.0] # TODO: raise? | |
| 501 | |
| 502 data = self.data | |
| 503 len_data, min_data, max_data = len(data), min(data), max(data) | |
| 504 | |
| 505 if len_data < 4: | |
| 506 if not count: | |
| 507 count = len_data | |
| 508 dx = (max_data - min_data) / float(count) | |
| 509 bins = [min_data + (dx * i) for i in range(count)] | |
| 510 elif count is None: | |
| 511 # freedman algorithm for fixed-width bin selection | |
| 512 q25, q75 = self.get_quantile(0.25), self.get_quantile(0.75) | |
| 513 dx = 2 * (q75 - q25) / (len_data ** (1 / 3.0)) | |
| 514 bin_count = max(1, int(ceil((max_data - min_data) / dx))) | |
| 515 bins = [min_data + (dx * i) for i in range(bin_count + 1)] | |
| 516 bins = [b for b in bins if b < max_data] | |
| 517 else: | |
| 518 dx = (max_data - min_data) / float(count) | |
| 519 bins = [min_data + (dx * i) for i in range(count)] | |
| 520 | |
| 521 if with_max: | |
| 522 bins.append(float(max_data)) | |
| 523 | |
| 524 return bins | |
| 525 | |
| 526 def get_histogram_counts(self, bins=None, **kw): | |
| 527 """Produces a list of ``(bin, count)`` pairs comprising a histogram of | |
| 528 the Stats object's data, using fixed-width bins. See | |
| 529 :meth:`Stats.format_histogram` for more details. | |
| 530 | |
| 531 Args: | |
| 532 bins (int): maximum number of bins, or list of | |
| 533 floating-point bin boundaries. Defaults to the output of | |
| 534 Freedman's algorithm. | |
| 535 bin_digits (int): Number of digits used to round down the | |
| 536 bin boundaries. Defaults to 1. | |
| 537 | |
| 538 The output of this method can be stored and/or modified, and | |
| 539 then passed to :func:`statsutils.format_histogram_counts` to | |
| 540 achieve the same text formatting as the | |
| 541 :meth:`~Stats.format_histogram` method. This can be useful for | |
| 542 snapshotting over time. | |
| 543 """ | |
| 544 bin_digits = int(kw.pop('bin_digits', 1)) | |
| 545 if kw: | |
| 546 raise TypeError('unexpected keyword arguments: %r' % kw.keys()) | |
| 547 | |
| 548 if not bins: | |
| 549 bins = self._get_bin_bounds() | |
| 550 else: | |
| 551 try: | |
| 552 bin_count = int(bins) | |
| 553 except TypeError: | |
| 554 try: | |
| 555 bins = [float(x) for x in bins] | |
| 556 except Exception: | |
| 557 raise ValueError('bins expected integer bin count or list' | |
| 558 ' of float bin boundaries, not %r' % bins) | |
| 559 if self.min < bins[0]: | |
| 560 bins = [self.min] + bins | |
| 561 else: | |
| 562 bins = self._get_bin_bounds(bin_count) | |
| 563 | |
| 564 # floor and ceil really should have taken ndigits, like round() | |
| 565 round_factor = 10.0 ** bin_digits | |
| 566 bins = [floor(b * round_factor) / round_factor for b in bins] | |
| 567 bins = sorted(set(bins)) | |
| 568 | |
| 569 idxs = [bisect.bisect(bins, d) - 1 for d in self.data] | |
| 570 count_map = {} # would have used Counter, but py26 support | |
| 571 for idx in idxs: | |
| 572 try: | |
| 573 count_map[idx] += 1 | |
| 574 except KeyError: | |
| 575 count_map[idx] = 1 | |
| 576 | |
| 577 bin_counts = [(b, count_map.get(i, 0)) for i, b in enumerate(bins)] | |
| 578 | |
| 579 return bin_counts | |
| 580 | |
| 581 def format_histogram(self, bins=None, **kw): | |
| 582 """Produces a textual histogram of the data, using fixed-width bins, | |
| 583 allowing for simple visualization, even in console environments. | |
| 584 | |
| 585 >>> data = list(range(20)) + list(range(5, 15)) + [10] | |
| 586 >>> print(Stats(data).format_histogram(width=30)) | |
| 587 0.0: 5 ######### | |
| 588 4.4: 8 ############### | |
| 589 8.9: 11 #################### | |
| 590 13.3: 5 ######### | |
| 591 17.8: 2 #### | |
| 592 | |
| 593 In this histogram, five values are between 0.0 and 4.4, eight | |
| 594 are between 4.4 and 8.9, and two values lie between 17.8 and | |
| 595 the max. | |
| 596 | |
| 597 You can specify the number of bins, or provide a list of | |
| 598 bin boundaries themselves. If no bins are provided, as in the | |
| 599 example above, `Freedman's algorithm`_ for bin selection is | |
| 600 used. | |
| 601 | |
| 602 Args: | |
| 603 bins (int): Maximum number of bins for the | |
| 604 histogram. Also accepts a list of floating-point | |
| 605 bin boundaries. If the minimum boundary is still | |
| 606 greater than the minimum value in the data, that | |
| 607 boundary will be implicitly added. Defaults to the bin | |
| 608 boundaries returned by `Freedman's algorithm`_. | |
| 609 bin_digits (int): Number of digits to round each bin | |
| 610 to. Note that bins are always rounded down to avoid | |
| 611 clipping any data. Defaults to 1. | |
| 612 width (int): integer number of columns in the longest line | |
| 613 in the histogram. Defaults to console width on Python | |
| 614 3.3+, or 80 if that is not available. | |
| 615 format_bin (callable): Called on each bin to create a | |
| 616 label for the final output. Use this function to add | |
| 617 units, such as "ms" for milliseconds. | |
| 618 | |
| 619 Should you want something more programmatically reusable, see | |
| 620 the :meth:`~Stats.get_histogram_counts` method, the output of | |
| 621 is used by format_histogram. The :meth:`~Stats.describe` | |
| 622 method is another useful summarization method, albeit less | |
| 623 visual. | |
| 624 | |
| 625 .. _Freedman's algorithm: https://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule | |
| 626 """ | |
| 627 width = kw.pop('width', None) | |
| 628 format_bin = kw.pop('format_bin', None) | |
| 629 bin_counts = self.get_histogram_counts(bins=bins, **kw) | |
| 630 return format_histogram_counts(bin_counts, | |
| 631 width=width, | |
| 632 format_bin=format_bin) | |
| 633 | |
| 634 def describe(self, quantiles=None, format=None): | |
| 635 """Provides standard summary statistics for the data in the Stats | |
| 636 object, in one of several convenient formats. | |
| 637 | |
| 638 Args: | |
| 639 quantiles (list): A list of numeric values to use as | |
| 640 quantiles in the resulting summary. All values must be | |
| 641 0.0-1.0, with 0.5 representing the median. Defaults to | |
| 642 ``[0.25, 0.5, 0.75]``, representing the standard | |
| 643 quartiles. | |
| 644 format (str): Controls the return type of the function, | |
| 645 with one of three valid values: ``"dict"`` gives back | |
| 646 a :class:`dict` with the appropriate keys and | |
| 647 values. ``"list"`` is a list of key-value pairs in an | |
| 648 order suitable to pass to an OrderedDict or HTML | |
| 649 table. ``"text"`` converts the values to text suitable | |
| 650 for printing, as seen below. | |
| 651 | |
| 652 Here is the information returned by a default ``describe``, as | |
| 653 presented in the ``"text"`` format: | |
| 654 | |
| 655 >>> stats = Stats(range(1, 8)) | |
| 656 >>> print(stats.describe(format='text')) | |
| 657 count: 7 | |
| 658 mean: 4.0 | |
| 659 std_dev: 2.0 | |
| 660 mad: 2.0 | |
| 661 min: 1 | |
| 662 0.25: 2.5 | |
| 663 0.5: 4 | |
| 664 0.75: 5.5 | |
| 665 max: 7 | |
| 666 | |
| 667 For more advanced descriptive statistics, check out my blog | |
| 668 post on the topic `Statistics for Software | |
| 669 <https://www.paypal-engineering.com/2016/04/11/statistics-for-software/>`_. | |
| 670 | |
| 671 """ | |
| 672 if format is None: | |
| 673 format = 'dict' | |
| 674 elif format not in ('dict', 'list', 'text'): | |
| 675 raise ValueError('invalid format for describe,' | |
| 676 ' expected one of "dict"/"list"/"text", not %r' | |
| 677 % format) | |
| 678 quantiles = quantiles or [0.25, 0.5, 0.75] | |
| 679 q_items = [] | |
| 680 for q in quantiles: | |
| 681 q_val = self.get_quantile(q) | |
| 682 q_items.append((str(q), q_val)) | |
| 683 | |
| 684 items = [('count', self.count), | |
| 685 ('mean', self.mean), | |
| 686 ('std_dev', self.std_dev), | |
| 687 ('mad', self.mad), | |
| 688 ('min', self.min)] | |
| 689 | |
| 690 items.extend(q_items) | |
| 691 items.append(('max', self.max)) | |
| 692 if format == 'dict': | |
| 693 ret = dict(items) | |
| 694 elif format == 'list': | |
| 695 ret = items | |
| 696 elif format == 'text': | |
| 697 ret = '\n'.join(['%s%s' % ((label + ':').ljust(10), val) | |
| 698 for label, val in items]) | |
| 699 return ret | |
| 700 | |
| 701 | |
| 702 def describe(data, quantiles=None, format=None): | |
| 703 """A convenience function to get standard summary statistics useful | |
| 704 for describing most data. See :meth:`Stats.describe` for more | |
| 705 details. | |
| 706 | |
| 707 >>> print(describe(range(7), format='text')) | |
| 708 count: 7 | |
| 709 mean: 3.0 | |
| 710 std_dev: 2.0 | |
| 711 mad: 2.0 | |
| 712 min: 0 | |
| 713 0.25: 1.5 | |
| 714 0.5: 3 | |
| 715 0.75: 4.5 | |
| 716 max: 6 | |
| 717 | |
| 718 See :meth:`Stats.format_histogram` for another very useful | |
| 719 summarization that uses textual visualization. | |
| 720 """ | |
| 721 return Stats(data).describe(quantiles=quantiles, format=format) | |
| 722 | |
| 723 | |
| 724 def _get_conv_func(attr_name): | |
| 725 def stats_helper(data, default=0.0): | |
| 726 return getattr(Stats(data, default=default, use_copy=False), | |
| 727 attr_name) | |
| 728 return stats_helper | |
| 729 | |
| 730 | |
| 731 for attr_name, attr in list(Stats.__dict__.items()): | |
| 732 if isinstance(attr, _StatsProperty): | |
| 733 if attr_name in ('max', 'min', 'count'): # don't shadow builtins | |
| 734 continue | |
| 735 if attr_name in ('mad',): # convenience aliases | |
| 736 continue | |
| 737 func = _get_conv_func(attr_name) | |
| 738 func.__doc__ = attr.func.__doc__ | |
| 739 globals()[attr_name] = func | |
| 740 delattr(Stats, '_calc_' + attr_name) | |
| 741 # cleanup | |
| 742 del attr | |
| 743 del attr_name | |
| 744 del func | |
| 745 | |
| 746 | |
| 747 def format_histogram_counts(bin_counts, width=None, format_bin=None): | |
| 748 """The formatting logic behind :meth:`Stats.format_histogram`, which | |
| 749 takes the output of :meth:`Stats.get_histogram_counts`, and passes | |
| 750 them to this function. | |
| 751 | |
| 752 Args: | |
| 753 bin_counts (list): A list of bin values to counts. | |
| 754 width (int): Number of character columns in the text output, | |
| 755 defaults to 80 or console width in Python 3.3+. | |
| 756 format_bin (callable): Used to convert bin values into string | |
| 757 labels. | |
| 758 """ | |
| 759 lines = [] | |
| 760 if not format_bin: | |
| 761 format_bin = lambda v: v | |
| 762 if not width: | |
| 763 try: | |
| 764 import shutil # python 3 convenience | |
| 765 width = shutil.get_terminal_size()[0] | |
| 766 except Exception: | |
| 767 width = 80 | |
| 768 | |
| 769 bins = [b for b, _ in bin_counts] | |
| 770 count_max = max([count for _, count in bin_counts]) | |
| 771 count_cols = len(str(count_max)) | |
| 772 | |
| 773 labels = ['%s' % format_bin(b) for b in bins] | |
| 774 label_cols = max([len(l) for l in labels]) | |
| 775 tmp_line = '%s: %s #' % ('x' * label_cols, count_max) | |
| 776 | |
| 777 bar_cols = max(width - len(tmp_line), 3) | |
| 778 line_k = float(bar_cols) / count_max | |
| 779 tmpl = "{label:>{label_cols}}: {count:>{count_cols}} {bar}" | |
| 780 for label, (bin_val, count) in zip(labels, bin_counts): | |
| 781 bar_len = int(round(count * line_k)) | |
| 782 bar = ('#' * bar_len) or '|' | |
| 783 line = tmpl.format(label=label, | |
| 784 label_cols=label_cols, | |
| 785 count=count, | |
| 786 count_cols=count_cols, | |
| 787 bar=bar) | |
| 788 lines.append(line) | |
| 789 | |
| 790 return '\n'.join(lines) |
