`<algorithm>`: vectorize `min_element` / `max_element` using SSE4.1/AVX2 for integers #2438

AlexGuteniev · 2021-12-20T12:58:58Z

With contiguous memory range and the default predicate (a flavor of less) or no predicate it can be vectorized.
SSE4.1 max_element algorithm.

Vertical part:

_mm_cmpgt_epi to compute mask of greater value of current against previous
_mm_add_epi to count indices of the current value (add precomputed vector _mm_set1_epi with 1)
_mm_blendv_epi8 to select indices of greater value into _Cur_idx_max
_mm_max_epi to update previous maximum, or alternatively can use _mm_blendv_epi8 again (for 64-bit where there's no max)
loop until the end or until some number to make sure accumulated _Cur_idx_max does not overflow

Horizontal part:

out of the vector, select greatest value by shuffling and comparing with itself with _mm_max_epi
find mask of greatest value by _mm_cmpeq_epi with maximum
use mask of greatest value with _mm_blendv_epi8 on _Cur_idx_max to get vector of indices of greater value or all FFs for not the greatest indices
out of indices vector find smallest indices by shuffling and comparing with itself with _mm_min_epu
find smallest indices mask. Bitwise-and this mask with the previous indices mask to handle the case if all FFs is actually the index.
Extract the mask to GPR and _BitScanForward into _H_pos
knowing index of greater element in vector, extract its index _Cur_idx_max into _V_pos
the answer is _V_pos*16 + _Result_index relative to the starting pointer in bytes
if loop stopped to prevent _Result_positions overflow, start again with previous maximum already initialized

(For 64-bit vector it takes _mm_cmpgt_epi64 from SSE4.2 actually, but we don't distinguish SSE4.1 and SSE4.2 anyway)

It all complex due to that algorithms return iterator, not value. Still the vectorization should provide perf improvement.

The text was updated successfully, but these errors were encountered:

Resolves microsoft#2438 TODO: * Test coverage * Attach minmax_element * Add AVX2 version of the same ---- <detail> <summary><b>Benchmark</b></summary> ```C++ #include <algorithm> #include <cstdint> #include <chrono> #include <iostream> #include <ranges> #include <intrin.h> enum class Kind { Min, Max, }; template<typename T> void benchmark_find(T* a, std::size_t max, size_t start, size_t pos, Kind kind, size_t rep) { std::fill_n(a, max, '0'); if (pos < max && pos >= start) { if (kind == Kind::Min) { a[pos] = '*'; } else { a[pos] = '1'; } } auto t1 = std::chrono::steady_clock::now(); switch (kind) { case Kind::Min: for (std::size_t s = 0; s < rep; s++) { if (std::min_element(a + start, a + max) != a + pos) { abort(); } } break; case Kind::Max: for (std::size_t s = 0; s < rep; s++) { if (std::min_element(a + start, a + max) != a + pos) { abort(); } } break; } auto t2 = std::chrono::steady_clock::now(); const char* op_str = nullptr; switch (kind) { case Kind::Min: op_str = "min"; break; case Kind::Max: op_str = "max"; break; } std::cout << std::setw(10) << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() << "s -- " << "Op " << op_str << " Size " << sizeof(T) << " byte elements, array size " << max << " starting at " << start << " found at " << pos << "; " << rep << " repetitions \n"; } constexpr std::size_t Nmax = 8192; alignas(64) std::uint8_t a8[Nmax]; alignas(64) std::uint16_t a16[Nmax]; alignas(64) std::uint32_t a32[Nmax]; alignas(64) std::uint64_t a64[Nmax]; extern "C" long __isa_enabled; int main() { std::cout << "Vector alg used: " << _USE_STD_VECTOR_ALGORITHMS << "\n"; benchmark_find(a8, Nmax, 0, 3459, Kind::Min, 100000); benchmark_find(a16, Nmax, 0, 3459, Kind::Min, 100000); benchmark_find(a32, Nmax, 0, 3459, Kind::Min, 100000); benchmark_find(a64, Nmax, 0, 3459, Kind::Min, 100000); benchmark_find(a8, Nmax, 0, 3459, Kind::Max, 100000); benchmark_find(a16, Nmax, 0, 3459, Kind::Max, 100000); benchmark_find(a32, Nmax, 0, 3459, Kind::Max, 100000); benchmark_find(a64, Nmax, 0, 3459, Kind::Max, 100000); std::cout << "Done\n"; return 0; } ``` <detail> <summary><b>Current benchmark results</b></summary> ``` ********************************************************************** ** Visual Studio 2022 Developer Command Prompt v17.1.0-pre.1.1 ** Copyright (c) 2021 Microsoft Corporation ********************************************************************** [vcvarsall.bat] Environment initialized for: 'x64' C:\Program Files\Microsoft Visual Studio\2022\Preview>cd/d C:\Project\vector_find_benchmark C:\Project\vector_find_benchmark>set INCLUDE=C:\Project\STL\out\build\x64\out\inc;%INCLUDE% C:\Project\vector_find_benchmark>set LIB=C:\Project\STL\out\build\x64\out\lib\amd64;%LIB% C:\Project\vector_find_benchmark>set PATH=C:\Project\STL\out\build\x64\out\bin\amd64;%PATH% C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=0 /nologo vector_find_benchmark.cpp vector_find_benchmark.cpp vector_find_benchmark.cpp(1): warning C4005: '_USE_STD_VECTOR_ALGORITHMS': macro redefinition vector_find_benchmark.cpp: note: see previous definition of '_USE_STD_VECTOR_ALGORITHMS' C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=0 /nologo vector_find_benchmark.cpp vector_find_benchmark.cpp C:\Project\vector_find_benchmark>vector_find_benchmark.exe Vector alg used: 0 1.48497s -- Op min Size 1 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 1.48125s -- Op min Size 2 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 1.47988s -- Op min Size 4 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 1.48431s -- Op min Size 8 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=1 /nologo vector_find_benchmark.cpp vector_find_benchmark.cpp C:\Project\vector_find_benchmark>vector_find_benchmark.exe Vector alg used: 1 0.0559598s -- Op min Size 1 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 0.0681002s -- Op min Size 2 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 0.159074s -- Op min Size 4 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions 0.597614s -- Op min Size 8 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions ``` </detail>

AlexGuteniev mentioned this issue Dec 20, 2021

<algorithm>: vectorize min_element / max_element using SSE4.1/AVX for floats #2439

Closed

AlexGuteniev changed the title ~~<algorithm>: vectorize min_element / max_element /using SSE4.1/AVX2 for integers~~ <algorithm>: vectorize min_element / max_element using SSE4.1/AVX2 for integers Dec 23, 2021

AlexGuteniev mentioned this issue Dec 26, 2021

Implement vectorized min_ / max_element for ints #2447

Merged

StephanTLavavej added the performance Must go faster label Jan 12, 2022

StephanTLavavej closed this as completed in #2447 Jun 19, 2022

StephanTLavavej added the fixed Something works now, yay! label Jun 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`<algorithm>`: vectorize `min_element` / `max_element` using SSE4.1/AVX2 for integers #2438

`<algorithm>`: vectorize `min_element` / `max_element` using SSE4.1/AVX2 for integers #2438

AlexGuteniev commented Dec 20, 2021 •

edited

Loading

<algorithm>: vectorize min_element / max_element using SSE4.1/AVX2 for integers #2438

<algorithm>: vectorize min_element / max_element using SSE4.1/AVX2 for integers #2438

Comments

AlexGuteniev commented Dec 20, 2021 • edited Loading

`<algorithm>`: vectorize `min_element` / `max_element` using SSE4.1/AVX2 for integers #2438

`<algorithm>`: vectorize `min_element` / `max_element` using SSE4.1/AVX2 for integers #2438

AlexGuteniev commented Dec 20, 2021 •

edited

Loading