3 Elegant Decision Forests in Python. Fast, clean and clear C++14 implementation
4 with no dependencies and with an easy to use interface in Python and C++.
9 The library provides an object oriented implementation of parts of the
12 [Criminisi et al.](https://www.microsoft.com/en-us/research/publication/decision-forests-for-classification-regression-density-estimation-manifold-learning-and-semi-supervised-learning/).
13 This makes the library easy to understand and extend while maintaining full
14 generality. Objects are easy to reuse and recombine.
16 There are no dependencies. The only build requirement is a C++14 compatible
17 compiler (tested with GCC and clang). We use the CMake build system with
18 pre-compiled headers for fast builds and compatibility to many platforms.
20 The library exposes a Python and C++ interface that accepts numpy/Eigen
21 arrays of many types. The interface follows the widely used Scikit-learn
22 interface with `fit` and `predict` methods.
24 The implementation of algorithms, while caring for readability and
25 maintainability, is highly optimized for speed (benchmarks see below). We
26 outperform scikit learn in all settings by notable margins at fit and predict
27 time. For fitting, we enable fully deterministic parallelization even during
28 node optimization, allowing to fully leverage modern CPUs with many cores.
30 We use the [cereal](https://uscilab.github.io/cereal/) C++ serialization
31 library enabling binary persistence (C++, pickle) or JSON export. The binary
32 models need only a fraction of space of the corresponding scikit learn
37 We use modern C++14 to create highly flexible, highly efficient data structures
38 and algorithm implementations. Core building blocks are:
41 efficient [`variant` implementation](https://github.com/mapbox/variant)
42 for high efficiency w.r.t. datatype dependent storage and processing while
43 maintaining an uncluttered interface and
44 implementation. [`glog`](https://github.com/google/glog)
45 and [`gperftools`](https://github.com/gperftools/gperftools) integration is
46 used for easy debugging and optimization. All dependency libraries don't need
47 to be installed but are part of the repository and completely integrated into
50 * CMake is used as a build system so that many platforms and compilers can be
51 targeted easily. [`cotire`](https://github.com/sakra/cotire) is used for
52 automatic fast pre-compiled header builds. The whole package is automatically
53 `pip` installable even without an installed CMake thanks
54 to [`scikit-build`](https://github.com/scikit-build/scikit-build).
56 * We use standard [`Eigen`](http://eigen.tuxfamily.org) datastructures wrapped
57 in `variant` to provide an easy-to-use C++ interface. We create a small and
58 sleek Python interface thanks
59 to [pybind11](https://github.com/pybind/pybind11).
61 * Threading is implemented
62 with [cpptask](https://github.com/Kolkir/cpptask) to be efficient and
65 ## Compilation & Installation
67 If you want to use it from python, a simple `python setup.py install` should do.
68 If you want to use it from C++, you can rely on CMake and do `mkdir build; cd
69 build; cmake ..; cmake --build . -- -j` for an out-of-source build.
71 ## Development & Contributing
73 ### Formatting and conventions
76 with [clang-format](https://clang.llvm.org/docs/ClangFormat.html) according to
77 the google C++ style guidelines (a `.clang-format` file is provided on project
78 level). We use abbreviated CamCase class names. For library internal
79 functions, assertions must be done using the `FASSERT` macro to selectively enable it,
80 but should be disabled for performance reasons for release builds.
82 Error messages *must* be meaningful and provide additional information on what
85 Design decisions are made according to the following priorities:: correctness &
86 numerical stability >> speed >> in-memory efficiency >> storage efficiency.
88 #### Rough naming conventions
90 Raw pointer variable names end with `_p`, indicating (1) high performance
91 element access and (2) special care. In performance relevant loops, only `_p`
92 variables should be used. Variant variable names end with `_v`.
94 ### Programming concepts
96 **Lock-free parallelism:**
98 Memory is allocated for the tree and leaf storage structures in before entering
99 parallel regions in a pessimistic way. Since it is linear in the number of
100 samples, this is not too much of an overhead. The pointer to where next
101 nodes/leafs can be created is an `std::atomic<size_t>`. After training, the
102 datastructures are resized to their proper size.
104 **The [Desk](@ref forpydeskGroup) classes:** this is a helper concept to
105 simplify safe parallelism. A 'desk' contains all thread-local variables and
106 pointers to the shared storage. Each thread has its own [desk](@ref forpy::Desk)
107 with sub-objects containing thread-local storage for the corresponding
108 sub-functions. It is constructed with pointers to the memory for storing the
109 results of the training.
113 The library itself is available under the 2-clause BSD license. All libraries
114 used are also available under open source licenses, for details see
115 `build_support/external`.