-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<algorithm>
: vectorize min_element
/ max_element
using SSE4.1/AVX2 for integers
#2438
Labels
Comments
AlexGuteniev
changed the title
Dec 23, 2021
<algorithm>
: vectorize min_element
/ max_element
/using SSE4.1/AVX2 for integers<algorithm>
: vectorize min_element
/ max_element
using SSE4.1/AVX2 for integers
AlexGuteniev
added a commit
to AlexGuteniev/STL
that referenced
this issue
Dec 26, 2021
Resolves microsoft#2438 TODO: * Test coverage * Attach minmax_element * Add AVX2 version of the same ---- <detail> <summary><b>Benchmark</b></summary> ```C++ #include <algorithm> #include <cstdint> #include <chrono> #include <iostream> #include <ranges> #include <intrin.h> enum class Kind { Min, Max, }; template<typename T> void benchmark_find(T* a, std::size_t max, size_t start, size_t pos, Kind kind, size_t rep) { std::fill_n(a, max, '0'); if (pos < max && pos >= start) { if (kind == Kind::Min) { a[pos] = '*'; } else { a[pos] = '1'; } } auto t1 = std::chrono::steady_clock::now(); switch (kind) { case Kind::Min: for (std::size_t s = 0; s < rep; s++) { if (std::min_element(a + start, a + max) != a + pos) { abort(); } } break; case Kind::Max: for (std::size_t s = 0; s < rep; s++) { if (std::min_element(a + start, a + max) != a + pos) { abort(); } } break; } auto t2 = std::chrono::steady_clock::now(); const char* op_str = nullptr; switch (kind) { case Kind::Min: op_str = "min"; break; case Kind::Max: op_str = "max"; break; } std::cout << std::setw(10) << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() << "s -- " << "Op " << op_str << " Size " << sizeof(T) << " byte elements, array size " << max << " starting at " << start << " found at " << pos << "; " << rep << " repetitions \n"; } constexpr std::size_t Nmax = 8192; alignas(64) std::uint8_t a8[Nmax]; alignas(64) std::uint16_t a16[Nmax]; alignas(64) std::uint32_t a32[Nmax]; alignas(64) std::uint64_t a64[Nmax]; extern "C" long __isa_enabled; int main() { std::cout << "Vector alg used: " << _USE_STD_VECTOR_ALGORITHMS << "\n"; benchmark_find(a8, Nmax, 0, 3459, Kind::Min, 100000); benchmark_find(a16, Nmax, 0, 3459, Kind::Min, 100000); benchmark_find(a32, Nmax, 0, 3459, Kind::Min, 100000); benchmark_find(a64, Nmax, 0, 3459, Kind::Min, 100000); benchmark_find(a8, Nmax, 0, 3459, Kind::Max, 100000); benchmark_find(a16, Nmax, 0, 3459, Kind::Max, 100000); benchmark_find(a32, Nmax, 0, 3459, Kind::Max, 100000); benchmark_find(a64, Nmax, 0, 3459, Kind::Max, 100000); std::cout << "Done\n"; return 0; } ``` <detail> <summary><b>Current benchmark results</b></summary> ``` ********************************************************************** ** Visual Studio 2022 Developer Command Prompt v17.1.0-pre.1.1 ** Copyright (c) 2021 Microsoft Corporation ********************************************************************** [vcvarsall.bat] Environment initialized for: 'x64' C:\Program Files\Microsoft Visual Studio\2022\Preview>cd/d C:\Project\vector_find_benchmark C:\Project\vector_find_benchmark>set INCLUDE=C:\Project\STL\out\build\x64\out\inc;%INCLUDE% C:\Project\vector_find_benchmark>set LIB=C:\Project\STL\out\build\x64\out\lib\amd64;%LIB% C:\Project\vector_find_benchmark>set PATH=C:\Project\STL\out\build\x64\out\bin\amd64;%PATH% C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=0 /nologo vector_find_benchmark.cpp vector_find_benchmark.cpp vector_find_benchmark.cpp(1): warning C4005: '_USE_STD_VECTOR_ALGORITHMS': macro redefinition vector_find_benchmark.cpp: note: see previous definition of '_USE_STD_VECTOR_ALGORITHMS' C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=0 /nologo vector_find_benchmark.cpp vector_find_benchmark.cpp C:\Project\vector_find_benchmark>vector_find_benchmark.exe Vector alg used: 0 1.48497s -- Op min Size 1 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 1.48125s -- Op min Size 2 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 1.47988s -- Op min Size 4 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 1.48431s -- Op min Size 8 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=1 /nologo vector_find_benchmark.cpp vector_find_benchmark.cpp C:\Project\vector_find_benchmark>vector_find_benchmark.exe Vector alg used: 1 0.0559598s -- Op min Size 1 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 0.0681002s -- Op min Size 2 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 0.159074s -- Op min Size 4 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 0.597614s -- Op min Size 8 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions ``` </detail>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
With contiguous memory range and the default predicate (a flavor of
less
) or no predicate it can be vectorized.SSE4.1
max_element
algorithm.Vertical part:
_mm_cmpgt_epi
to compute mask of greater value of current against previous_mm_add_epi
to count indices of the current value (add precomputed vector_mm_set1_epi
with 1)_mm_blendv_epi8
to select indices of greater value into_Cur_idx_max
_mm_max_epi
to update previous maximum, or alternatively can use_mm_blendv_epi8
again (for 64-bit where there's no max)_Cur_idx_max
does not overflowHorizontal part:
_mm_max_epi
_mm_cmpeq_epi
with maximum_mm_blendv_epi8
on_Cur_idx_max
to get vector of indices of greater value or all FFs for not the greatest indices_mm_min_epu
_BitScanForward
into_H_pos
_Cur_idx_max
into_V_pos
_V_pos*16 + _Result_index
relative to the starting pointer in bytes_Result_positions
overflow, start again with previous maximum already initialized(For 64-bit vector it takes
_mm_cmpgt_epi64
from SSE4.2 actually, but we don't distinguish SSE4.1 and SSE4.2 anyway)It all complex due to that algorithms return iterator, not value. Still the vectorization should provide perf improvement.
The text was updated successfully, but these errors were encountered: