Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<algorithm>: vectorize min_element / max_element using SSE4.1/AVX2 for integers #2438

Closed
AlexGuteniev opened this issue Dec 20, 2021 · 0 comments · Fixed by #2447
Closed
Labels
fixed Something works now, yay! performance Must go faster

Comments

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Dec 20, 2021

With contiguous memory range and the default predicate (a flavor of less) or no predicate it can be vectorized.
SSE4.1 max_element algorithm.

Vertical part:

  • _mm_cmpgt_epi to compute mask of greater value of current against previous
  • _mm_add_epi to count indices of the current value (add precomputed vector _mm_set1_epi with 1)
  • _mm_blendv_epi8 to select indices of greater value into _Cur_idx_max
  • _mm_max_epi to update previous maximum, or alternatively can use _mm_blendv_epi8 again (for 64-bit where there's no max)
  • loop until the end or until some number to make sure accumulated _Cur_idx_max does not overflow

Horizontal part:

  • out of the vector, select greatest value by shuffling and comparing with itself with _mm_max_epi
  • find mask of greatest value by _mm_cmpeq_epi with maximum
  • use mask of greatest value with _mm_blendv_epi8 on _Cur_idx_max to get vector of indices of greater value or all FFs for not the greatest indices
  • out of indices vector find smallest indices by shuffling and comparing with itself with _mm_min_epu
  • find smallest indices mask. Bitwise-and this mask with the previous indices mask to handle the case if all FFs is actually the index.
  • Extract the mask to GPR and _BitScanForward into _H_pos
  • knowing index of greater element in vector, extract its index _Cur_idx_max into _V_pos
  • the answer is _V_pos*16 + _Result_index relative to the starting pointer in bytes
  • if loop stopped to prevent _Result_positions overflow, start again with previous maximum already initialized

(For 64-bit vector it takes _mm_cmpgt_epi64 from SSE4.2 actually, but we don't distinguish SSE4.1 and SSE4.2 anyway)

It all complex due to that algorithms return iterator, not value. Still the vectorization should provide perf improvement.

@AlexGuteniev AlexGuteniev changed the title <algorithm>: vectorize min_element / max_element /using SSE4.1/AVX2 for integers <algorithm>: vectorize min_element / max_element using SSE4.1/AVX2 for integers Dec 23, 2021
AlexGuteniev added a commit to AlexGuteniev/STL that referenced this issue Dec 26, 2021
Resolves microsoft#2438

TODO:
 * Test coverage
 * Attach minmax_element
 * Add AVX2 version of the same

----
<detail>
<summary><b>Benchmark</b></summary>

```C++
#include <algorithm>
#include <cstdint>
#include <chrono>
#include <iostream>
#include <ranges>
#include <intrin.h>

enum class Kind {
    Min,
    Max,
};

template<typename T>
void benchmark_find(T* a, std::size_t max, size_t start, size_t pos, Kind kind, size_t rep) {
    std::fill_n(a, max, '0');
    if (pos < max && pos >= start) {
        if (kind == Kind::Min) {
            a[pos] = '*';
        }
        else {
            a[pos] = '1';
        }
    }

    auto t1 = std::chrono::steady_clock::now();

    switch (kind)
    {
    case Kind::Min:
        for (std::size_t s = 0; s < rep; s++) {
            if (std::min_element(a + start, a + max) != a + pos) {
                abort();
            }
        }
        break;
    case Kind::Max:
        for (std::size_t s = 0; s < rep; s++) {
            if (std::min_element(a + start, a + max) != a + pos) {
                abort();
            }
        }
        break;
    }

    auto t2 = std::chrono::steady_clock::now();

    const char* op_str = nullptr;
    switch (kind)
    {
    case Kind::Min:
        op_str = "min";
        break;
    case Kind::Max:
        op_str = "max";
        break;
    }
    std::cout << std::setw(10) << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() << "s  -- "
        << "Op " << op_str << " Size " << sizeof(T) << " byte elements, array size " << max
        << " starting at " << start << " found at " << pos << "; " << rep << " repetitions \n";
}

constexpr std::size_t Nmax = 8192;

alignas(64) std::uint8_t    a8[Nmax];
alignas(64) std::uint16_t   a16[Nmax];
alignas(64) std::uint32_t   a32[Nmax];
alignas(64) std::uint64_t   a64[Nmax];

extern "C" long __isa_enabled;

int main()
{
    std::cout << "Vector alg used: " << _USE_STD_VECTOR_ALGORITHMS << "\n";

    benchmark_find(a8, Nmax, 0, 3459, Kind::Min, 100000);
    benchmark_find(a16, Nmax, 0, 3459, Kind::Min, 100000);
    benchmark_find(a32, Nmax, 0, 3459, Kind::Min, 100000);
    benchmark_find(a64, Nmax, 0, 3459, Kind::Min, 100000);

    benchmark_find(a8, Nmax, 0, 3459, Kind::Max, 100000);
    benchmark_find(a16, Nmax, 0, 3459, Kind::Max, 100000);
    benchmark_find(a32, Nmax, 0, 3459, Kind::Max, 100000);
    benchmark_find(a64, Nmax, 0, 3459, Kind::Max, 100000);

    std::cout << "Done\n";

    return 0;
}
```

<detail>
<summary><b>Current benchmark results</b></summary>

```
**********************************************************************
** Visual Studio 2022 Developer Command Prompt v17.1.0-pre.1.1
** Copyright (c) 2021 Microsoft Corporation
**********************************************************************
[vcvarsall.bat] Environment initialized for: 'x64'

C:\Program Files\Microsoft Visual Studio\2022\Preview>cd/d C:\Project\vector_find_benchmark

C:\Project\vector_find_benchmark>set INCLUDE=C:\Project\STL\out\build\x64\out\inc;%INCLUDE%

C:\Project\vector_find_benchmark>set LIB=C:\Project\STL\out\build\x64\out\lib\amd64;%LIB%

C:\Project\vector_find_benchmark>set PATH=C:\Project\STL\out\build\x64\out\bin\amd64;%PATH%

C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=0 /nologo vector_find_benchmark.cpp
vector_find_benchmark.cpp
vector_find_benchmark.cpp(1): warning C4005: '_USE_STD_VECTOR_ALGORITHMS': macro redefinition
vector_find_benchmark.cpp: note: see previous definition of '_USE_STD_VECTOR_ALGORITHMS'

C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=0 /nologo vector_find_benchmark.cpp
vector_find_benchmark.cpp

C:\Project\vector_find_benchmark>vector_find_benchmark.exe
Vector alg used: 0
   1.48497s  -- Op min Size 1 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions
   1.48125s  -- Op min Size 2 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions
   1.47988s  -- Op min Size 4 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions
   1.48431s  -- Op min Size 8 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions

C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=1 /nologo vector_find_benchmark.cpp
vector_find_benchmark.cpp

C:\Project\vector_find_benchmark>vector_find_benchmark.exe
Vector alg used: 1
 0.0559598s  -- Op min Size 1 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions
 0.0681002s  -- Op min Size 2 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions
  0.159074s  -- Op min Size 4 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions
  0.597614s  -- Op min Size 8 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions
```
</detail>
@StephanTLavavej StephanTLavavej added the performance Must go faster label Jan 12, 2022
@StephanTLavavej StephanTLavavej added the fixed Something works now, yay! label Jun 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed Something works now, yay! performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants
  翻译: