forpy  2
/Users/classner/git/forpy/README.md
Go to the documentation of this file.
1 # forpy
2 
3 Elegant Decision Forests in Python. Fast, clean and clear C++14 implementation
4 with no dependencies and with an easy to use interface in Python and C++.
5 
6 ## Principles
7 
8 1. Easy to hack
9  The library provides an object oriented implementation of parts of the
10  theoretical framework
11  of
12  [Criminisi et al.](https://www.microsoft.com/en-us/research/publication/decision-forests-for-classification-regression-density-estimation-manifold-learning-and-semi-supervised-learning/).
13  This makes the library easy to understand and extend while maintaining full
14  generality. Objects are easy to reuse and recombine.
15 2. Easy to compile
16  There are no dependencies. The only build requirement is a C++14 compatible
17  compiler (tested with GCC and clang). We use the CMake build system with
18  pre-compiled headers for fast builds and compatibility to many platforms.
19 3. Easy to use
20  The library exposes a Python and C++ interface that accepts numpy/Eigen
21  arrays of many types. The interface follows the widely used Scikit-learn
22  interface with `fit` and `predict` methods.
23 4. Fast
24  The implementation of algorithms, while caring for readability and
25  maintainability, is highly optimized for speed (benchmarks see below). We
26  outperform scikit learn in all settings by notable margins at fit and predict
27  time. For fitting, we enable fully deterministic parallelization even during
28  node optimization, allowing to fully leverage modern CPUs with many cores.
29 5. Efficient
30  We use the [cereal](https://uscilab.github.io/cereal/) C++ serialization
31  library enabling binary persistence (C++, pickle) or JSON export. The binary
32  models need only a fraction of space of the corresponding scikit learn
33  models.
34 
35 ## Approach
36 
37 We use modern C++14 to create highly flexible, highly efficient data structures
38 and algorithm implementations. Core building blocks are:
39 
40 * A highly
41  efficient [`variant` implementation](https://github.com/mapbox/variant)
42  for high efficiency w.r.t. datatype dependent storage and processing while
43  maintaining an uncluttered interface and
44  implementation. [`glog`](https://github.com/google/glog)
45  and [`gperftools`](https://github.com/gperftools/gperftools) integration is
46  used for easy debugging and optimization. All dependency libraries don't need
47  to be installed but are part of the repository and completely integrated into
48  the build.
49 
50 * CMake is used as a build system so that many platforms and compilers can be
51  targeted easily. [`cotire`](https://github.com/sakra/cotire) is used for
52  automatic fast pre-compiled header builds. The whole package is automatically
53  `pip` installable even without an installed CMake thanks
54  to [`scikit-build`](https://github.com/scikit-build/scikit-build).
55 
56 * We use standard [`Eigen`](http://eigen.tuxfamily.org) datastructures wrapped
57  in `variant` to provide an easy-to-use C++ interface. We create a small and
58  sleek Python interface thanks
59  to [pybind11](https://github.com/pybind/pybind11).
60 
61 * Threading is implemented
62  with [cpptask](https://github.com/Kolkir/cpptask) to be efficient and
63  cross-platform.
64 
65 ## Compilation & Installation
66 
67 If you want to use it from python, a simple `python setup.py install` should do.
68 If you want to use it from C++, you can rely on CMake and do `mkdir build; cd
69 build; cmake ..; cmake --build . -- -j` for an out-of-source build.
70 
71 ## Development & Contributing
72 
73 ### Formatting and conventions
74 
75 The code is formatted
76 with [clang-format](https://clang.llvm.org/docs/ClangFormat.html) according to
77 the google C++ style guidelines (a `.clang-format` file is provided on project
78 level). We use abbreviated CamCase class names. For library internal
79 functions, assertions must be done using the `FASSERT` macro to selectively enable it,
80 but should be disabled for performance reasons for release builds.
81 
82 Error messages *must* be meaningful and provide additional information on what
83 caused the error.
84 
85 Design decisions are made according to the following priorities:: correctness &
86 numerical stability >> speed >> in-memory efficiency >> storage efficiency.
87 
88 #### Rough naming conventions
89 
90 Raw pointer variable names end with `_p`, indicating (1) high performance
91 element access and (2) special care. In performance relevant loops, only `_p`
92 variables should be used. Variant variable names end with `_v`.
93 
94 ### Programming concepts
95 
96 **Lock-free parallelism:**
97 
98 Memory is allocated for the tree and leaf storage structures in before entering
99 parallel regions in a pessimistic way. Since it is linear in the number of
100 samples, this is not too much of an overhead. The pointer to where next
101 nodes/leafs can be created is an `std::atomic<size_t>`. After training, the
102 datastructures are resized to their proper size.
103 
104 **The [Desk](@ref forpydeskGroup) classes:** this is a helper concept to
105 simplify safe parallelism. A 'desk' contains all thread-local variables and
106 pointers to the shared storage. Each thread has its own [desk](@ref forpy::Desk)
107 with sub-objects containing thread-local storage for the corresponding
108 sub-functions. It is constructed with pointers to the memory for storing the
109 results of the training.
110 
111 ## License
112 
113 The library itself is available under the 2-clause BSD license. All libraries
114 used are also available under open source licenses, for details see
115 `build_support/external`.