On Fri, Apr 26, 2024 at 6:30 AM H.J. Lu <hjl.tools@gmail.com> wrote:
On Thu, Apr 25, 2024 at 9:03 PM abush wang <abushwangs@gmail.com> wrote:
>
> Hi, H.J.
> When I test glibc performance between 2.28 and 2.38,
> I found there is a performance degradation about strlen.
> In fact, this difference comes from __strlen_avx2 and __strlen_evex
>
> ```
> 2.28
> __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:42
> 42 ENTRY (STRLEN)
>
>
> 2.38
> __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:79
> 79 ENTRY_P2ALIGN (STRLEN, 6)
> ```
>
> This is my test:
> ```
> #include <stdio.h>
> #include <stdlib.h>
> #include <stdint.h>
> #include <string.h>
>
> #define MAX_STRINGS 100
>
> uint64_t rdtsc() {
>     uint32_t lo, hi;
>     __asm__ __volatile__ (
>         "rdtsc" : "=a"(lo), "=d"(hi)
>     );
>     return ((uint64_t)hi << 32) | lo;
> }
>
> int main(int argc, char *argv[]) {
>     char *input_str[MAX_STRINGS];
>     size_t lengths[MAX_STRINGS];
>     int num_strings = 0; // Number of input strings
>     uint64_t start_cycles, end_cycles;
>
>     // Parse command line arguments and store pointers in input_str array
>     for (int i = 1; i < argc && num_strings < MAX_STRINGS; ++i) {
>         input_str[num_strings] = argv[i];
>         num_strings++;
>     }
>
>     // Measure the strlen operation for each string
>     start_cycles = rdtsc();
>     for (int i = 0; i < num_strings; ++i) {
>         lengths[i] = strlen(input_str[i]);
>     }
>     end_cycles = rdtsc();
>
>     unsigned long long total_cycle = end_cycles - start_cycles;
>     unsigned long long av_cycle = total_cycle / num_strings;
>     // Print the total cycles taken for the strlen operations
>     printf("Total cycles: %llu av cycle: %llu \n", total_cycle, av_cycle);
>
>     // Print the recorded lengths
>     printf("Lengths of the input strings:\n");
>     for (int i = 0; i < num_strings; ++i) {
>         printf("String %d length: %zu\n", i, lengths[i]);
>     }
>
>     return 0;
> }
> ```
>
> This is result
> ```
> 2.28
> ./strlen_test str1 str2 str3 str4 str5
> Total cycles: 1468 av cycle: 293
> Lengths of the input strings:
> String 0 length: 4
> String 1 length: 4
> String 2 length: 4
> String 3 length: 4
> String 4 length: 4
>
> 2.38
> ./strlen_test str1 str2 str3 str4 str5
> Total cycles: 1814 av cycle: 362
> Lengths of the input strings:
> String 0 length: 4
> String 1 length: 4
> String 2 length: 4
> String 3 length: 4
> String 4 length: 4
> ```
>
> Thanks,
> abush

I'm not sure how you are measuring the performance of strlen function.
Are you making performance conclusion based on these 2 runs?

2.28
Total cycles: 1468 av cycle: 293

2.38
Total cycles: 1814 av cycle: 362

Please use glibc microbenchmark to see if you can reproduce perf drop.
 

Which processors did you use?  Sunil, Noah, can we reproduce it?

--
H.J.