Age | Commit message (Collapse) | Author |
|
- use ALLOW_BOOL for -list arg instead of ALLOW_ANY
- touch up `-asymptote=<n1,n2,n3...>` help
- pack Args struct a bit more efficiently
- handle args in alphabetical order
|
|
When it is not easily possible to stabilize benchmark machine and code
the argument -min_time can be used to specify a minimum duration
that a benchmark should take. E.g. choose -min_time=1000 if you
are willing to wait about 1 second for each benchmark result.
The default is now set to 10ms instead of 0, which should make runs on
fast machines more stable with negligible slowdown.
|
|
|
|
|
|
This replaces the current benchmarking framework with nanobench [1], an
MIT licensed single-header benchmarking library, of which I am the
autor. This has in my opinion several advantages, especially on Linux:
* fast: Running all benchmarks takes ~6 seconds instead of 4m13s on
an Intel i7-8700 CPU @ 3.20GHz.
* accurate: I ran e.g. the benchmark for SipHash_32b 10 times and
calculate standard deviation / mean = coefficient of variation:
* 0.57% CV for old benchmarking framework
* 0.20% CV for nanobench
So the benchmark results with nanobench seem to vary less than with
the old framework.
* It automatically determines runtime based on clock precision, no need
to specify number of evaluations.
* measure instructions, cycles, branches, instructions per cycle,
branch misses (only Linux, when performance counters are available)
* output in markdown table format.
* Warn about unstable environment (frequency scaling, turbo, ...)
* For better profiling, it is possible to set the environment variable
NANOBENCH_ENDLESS to force endless running of a particular benchmark
without the need to recompile. This makes it to e.g. run "perf top"
and look at hotspots.
Here is an example copy & pasted from the terminal output:
| ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 2.52 | 396,529,415.94 | 0.6% | 25.42 | 8.02 | 3.169 | 0.06 | 0.0% | 0.03 | `bench/crypto_hash.cpp RIPEMD160`
| 1.87 | 535,161,444.83 | 0.3% | 21.36 | 5.95 | 3.589 | 0.06 | 0.0% | 0.02 | `bench/crypto_hash.cpp SHA1`
| 3.22 | 310,344,174.79 | 1.1% | 36.80 | 10.22 | 3.601 | 0.09 | 0.0% | 0.04 | `bench/crypto_hash.cpp SHA256`
| 2.01 | 496,375,796.23 | 0.0% | 18.72 | 6.43 | 2.911 | 0.01 | 1.0% | 0.00 | `bench/crypto_hash.cpp SHA256D64_1024`
| 7.23 | 138,263,519.35 | 0.1% | 82.66 | 23.11 | 3.577 | 1.63 | 0.1% | 0.00 | `bench/crypto_hash.cpp SHA256_32b`
| 3.04 | 328,780,166.40 | 0.3% | 35.82 | 9.69 | 3.696 | 0.03 | 0.0% | 0.03 | `bench/crypto_hash.cpp SHA512`
[1] https://github.com/martinus/nanobench
* Adds support for asymptotes
This adds support to calculate asymptotic complexity of a benchmark.
This is similar to #17375, but currently only one asymptote is
supported, and I have added support in the benchmark `ComplexMemPool`
as an example.
Usage is e.g. like this:
```
./bench_bitcoin -filter=ComplexMemPool -asymptote=25,50,100,200,400,600,800
```
This runs the benchmark `ComplexMemPool` several times but with
different complexityN settings. The benchmark can extract that number
and use it accordingly. Here, it's used for `childTxs`. The output is
this:
| complexityN | ns/op | op/s | err% | ins/op | cyc/op | IPC | total | benchmark
|------------:|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|----------:|:----------
| 25 | 1,064,241.00 | 939.64 | 1.4% | 3,960,279.00 | 2,829,708.00 | 1.400 | 0.01 | `ComplexMemPool`
| 50 | 1,579,530.00 | 633.10 | 1.0% | 6,231,810.00 | 4,412,674.00 | 1.412 | 0.02 | `ComplexMemPool`
| 100 | 4,022,774.00 | 248.58 | 0.6% | 16,544,406.00 | 11,889,535.00 | 1.392 | 0.04 | `ComplexMemPool`
| 200 | 15,390,986.00 | 64.97 | 0.2% | 63,904,254.00 | 47,731,705.00 | 1.339 | 0.17 | `ComplexMemPool`
| 400 | 69,394,711.00 | 14.41 | 0.1% | 272,602,461.00 | 219,014,691.00 | 1.245 | 0.76 | `ComplexMemPool`
| 600 | 168,977,165.00 | 5.92 | 0.1% | 639,108,082.00 | 535,316,887.00 | 1.194 | 1.86 | `ComplexMemPool`
| 800 | 310,109,077.00 | 3.22 | 0.1% |1,149,134,246.00 | 984,620,812.00 | 1.167 | 3.41 | `ComplexMemPool`
| coefficient | err% | complexity
|--------------:|-------:|------------
| 4.78486e-07 | 4.5% | O(n^2)
| 6.38557e-10 | 21.7% | O(n^3)
| 3.42338e-05 | 38.0% | O(n log n)
| 0.000313914 | 46.9% | O(n)
| 0.0129823 | 114.4% | O(log n)
| 0.0815055 | 133.8% | O(1)
The best fitting curve is O(n^2), so the algorithm seems to scale
quadratic with `childTxs` in the range 25 to 800.
|
|
|
|
-BEGIN VERIFY SCRIPT-
# Mark all lines with #includes
sed -i --regexp-extended -e 's/(#include <.*>)/\1 /g' $(git grep -l '#include' ./src/bench/ ./src/test ./src/wallet/test/)
# Sort all marked lines
git diff -U0 | ./contrib/devtools/clang-format-diff.py -p1 -i -v
-END VERIFY SCRIPT-
|
|
-BEGIN VERIFY SCRIPT-
./contrib/devtools/copyright_header.py update ./
-END VERIFY SCRIPT-
|
|
faa92a2297b4a6aebdd58d1818c428f1c0346078 rpc: Remove mempool global from miner (MarcoFalke)
6666ef13f167cfe880c2e94c09d003594d010cf3 test: Properly document blockinfo size in miner_tests (MarcoFalke)
Pull request description:
The miner needs read-only access to the mempool. Instead of using the mutable global `::mempool`, keep a immutable reference to a mempool that is passed to the miner. Apart from the obvious benefits of removing a global and making things immutable, this might also simplify testing with multiple mempools.
ACKs for top commit:
promag:
ACK faa92a2297b4a6aebdd58d1818c428f1c0346078.
fjahr:
ACK faa92a2297b4a6aebdd58d1818c428f1c0346078
jnewbery:
Code review ACK faa92a2297b4a6aebdd58d1818c428f1c0346078
Tree-SHA512: c44027b5d2217a724791166f3f3112c45110ac1dbb37bdae27148a0657e0d1a1d043b0d24e49fd45465ec014224d1b7eb15c92a33069ad883fa8ffeadc24735b
|
|
-BEGIN VERIFY SCRIPT-
./contrib/devtools/copyright_header.py update ./
-END VERIFY SCRIPT-
|
|
|
|
|
|
|
|
This trivial change adds the "override" keyword to some methods of
subclasses meant to override interface methods. This ensures that any
future change to the interface' method signatures which are not correctly
mirrored in the subclass will break at compile time with a clear error message,
rather than fail at runtime (which is harder to debug).
|
|
|
|
* inline performance critical code
* Average runtime is specified and used to calculate iterations.
* Console: show median of multiple runs
* plot: show box plot
* filter benchmarks
* specify scaling factor
* ignore src/test and src/bench in command line check script
* number of iterations instead of time
* Replaced runtime in BENCHMARK makro number of iterations.
* Added -? to bench_bitcoin
* Benchmark plotly.js URL, width, height can be customized
* Fixed incorrect precision warning
|
|
constructor
lastCycles was introduced in 35328187463a7078b4206e394c21d5515929c7de which was merged into master yesterday.
Also initialize beginCycles to zero for consistency and completeness.
|
|
|
|
|
|
std::chrono removes portability issues.
Rather than storing doubles, store the untouched time_points. Then
convert to nanoseconds for display. This allows for maximum precision, while
keeping results comparable between differing hardware/operating systems.
Also, display full nanosecond counts rather than sub-second floats.
|
|
We were saving a div by caching the inverse as a float, but this
ended up requiring a int -> float -> int conversion, which takes
almost as much time as the difference between float mul and div.
There are lots of other more pressing issues with the bench
framework which probably require simply removing the adaptive
iteration count stuff anyway.
|
|
|
|
|
|
The initialization order of global data structures in different
implementation units is undefined. Making use of this is essentially
gambling on what the linker does, the so-called [Static initialization
order fiasco](https://isocpp.org/wiki/faq/ctors#static-init-order).
In this case it apparently worked on Linux but failed on OpenBSD and
FreeBSD.
To create it on first use, make the registration structure local to
a function.
Fixes #8910.
|
|
Edited via:
$ contrib/devtools/copyright_header.py update .
|
|
This adds cycle min/max/avg to the statistics.
Supported on x86 and x86_64 (natively through rdtsc), as well as Linux
(perf syscall).
|
|
Previously the benchmark code used an integer division (%) with
a non-constant in the inner-loop. This is quite slow on many
processors, especially ones like ARM that lack a hardware divide.
Even on fairly recent x86_64 like haswell an integer division can
take something like 100 cycles-- making it comparable to the
runtime of siphash.
This change avoids the division by using bitmasking instead. This
was especially easy since the count was only increased by doubling.
This change also restarts the timing when the execution time was
very low this avoids mintimes of zero in cases where one execution
ends up below the timer resolution. It also reduces the impact of
the overhead on the final result.
The formatting of the prints is changed to not use scientific
notation make it more machine readable (in particular, gnuplot
croaks on the non-fixedpoint, and it doesn't sort correctly).
This also hoists out all the floating point divisions out of the
semi-hot path because it was easy to do so.
It might be prudent to break out the critical test into a macro
just to guarantee that it gets inlined. It might also make sense
to just save out the intermediate counts and times and get the
floating point completely out of the timing loop (because e.g.
on hardware without a fast hardware FPU like some ARM it will
still be slow enough to distort the results). I haven't done
either of these in this commit.
|
|
- ensure header namespaces and end comments are correct
- add missing header end comments
- ensure minimal formatting (add newlines etc.)
|
|
Avoid calling gettimeofday every time through the benchmarking loop, by keeping
track of how long each loop takes and doubling the number of iterations done
between time checks when they take less than 1/16'th of the total elapsed time.
|
|
Benchmarking framework, loosely based on google's micro-benchmarking
library (https://github.com/google/benchmark)
Wny not use the Google Benchmark framework? Because adding Even More Dependencies
isn't worth it. If we get a dozen or three benchmarks and need nanosecond-accurate
timings of threaded code then switching to the full-blown Google Benchmark library
should be considered.
The benchmark framework is hard-coded to run each benchmark for one wall-clock second,
and then spits out .csv-format timing information to stdout. It is left as an
exercise for later (or maybe never) to add command-line arguments to specify which
benchmark(s) to run, how long to run them for, how to format results, etc etc etc.
Again, see the Google Benchmark framework for where that might end up.
See src/bench/MilliSleep.cpp for a sanity-test benchmark that just benchmarks
'sleep 100 milliseconds.'
To compile and run benchmarks:
cd src; make bench
Sample output:
Benchmark,count,min,max,average
Sleep100ms,10,0.101854,0.105059,0.103881
|