Performance of calling POSIX-specified functions versus direct Linux kernel callsDiagram of Linux kernel vs....

Is this floating-point optimization allowed?

I quit, and boss offered me 3 month "grace period" where I could still come back

Too many spies!

nginx serves wrong domain site. It doenst shows default site if no configuration applies

Why do they not say "The Baby"

Help with understanding nuances of extremely popular Kyoto-ben (?) tweet

TikZ Can I draw an arrow by specifying the initial point, direction, and length?

Find values of x so that the matrix is invertible

Why limit to revolvers?

Won 50K! Now what should I do with it

Why did the Japanese attack the Aleutians at the same time as Midway?

Why is dry soil hydrophobic? Bad gardener paradox

What is temperature on a quantum level?

Cutting machine can't read vectors with strokes

How does one stock fund's charge of 1% more in operating expenses than another fund lower expected returns by 10%?

When is pointing out a person's hypocrisy not considered to be a logical fallacy?

Interpreting the word "randomly"

How can an advanced civilization forget how to manufacture its technology?

Supporting developers who insist on using their pet language

What is the English equivalent of 干物女 (dried fish woman)?

Can I intentionally omit previous work experience or pretend it doesn't exist when applying for jobs?

What's the point of this scene involving Flash Thompson at the airport?

If the derivative of a function is square of it then it is constant

Filtering fine silt/mud from water (not necessarily bacteria etc.)



Performance of calling POSIX-specified functions versus direct Linux kernel calls


Diagram of Linux kernel vs. performance tools?What is meant by “a system call” if not the implementation in the programing language?Cross compiling GLIBC for my ARM SoC`EINTR`: is there a rationale behind it?Why don't Linux utils use a system call to get the current time?`umount -R` on bind mounts takes a non-neglible amount of time, why?Why did the system call registers and order change from Intel 32bit to 64bit?FreeBSD vs Linux: performance of kernel calling conventionsCould AIO fsync improve dpkg performance?Why does rename() take longer when fsync() is called first?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}







2















In an answer over on Stack Overflow, I provided a code sample to perform some small task referenced in the question. The original question had to do with the fastest-performing technique (so performance criteria are in play, here).



Another commenter/answerer suggested that making a POSIX-defined system API call (in this case, readdir) was not as fast as making a direct system call into the kernel (syscall(SYS_getdents,...)) and the claimed performance difference is in the 25% range. (I didn't implement and re-benchmark; I believe that the performance could in fact be better.)



My question is about the performance characteristics of the proposed syscall-based solution and why they might be faster. I can think of a few reasons why performance might be better:




  1. POSIX readdir is inherently more complicated than syscall(SYS_getdents,...)/getdents()


  2. readdir (which presumably calls syscall(SYS_getdents,...) simply adds the overhead of indirection


  3. readdir only returns one record (per kernel-call) versus syscall(SYS_getdents,...)/getdents()` which returns (presumably) more than one record per kernel-call


I can't imagine that #1 above is true. readdir and getdents are so similar that the implementation of readdir in glibc simply can't have many more "true" system calls than a direct-invocation of syscall(SYS_getdents,...)/getdents() would invoke.



I can't imagine that #2 is true, either, since calling readdir likely wraps getdents and also syscall(SYS_getdents,...) likely calls getdents as well (the proposed answer specifically uses syscall(SYS_getdents,...) instead of calling getdents directly. It's possible that everything within glibc on Linux boils down to syscall(syscallid, args) in which case #2 probably is true.



The last possibility seems to me to be the best explanation: fewer calls into the kernel simply results in faster performance.



Is there any specific explanation for why a "direct kernel call" would be measurably faster than calling a POSIX-defined function?










share|improve this question




















  • 2





    POSIX does not specify any system calls, it specifies APIs

    – fpmurphy
    Jan 10 '18 at 2:14













  • @fpmurphy1 Pardon me for misspeaking. I meant POSIX-defined APIs not system calls, of course.

    – Christopher Schultz
    Jan 10 '18 at 21:08


















2















In an answer over on Stack Overflow, I provided a code sample to perform some small task referenced in the question. The original question had to do with the fastest-performing technique (so performance criteria are in play, here).



Another commenter/answerer suggested that making a POSIX-defined system API call (in this case, readdir) was not as fast as making a direct system call into the kernel (syscall(SYS_getdents,...)) and the claimed performance difference is in the 25% range. (I didn't implement and re-benchmark; I believe that the performance could in fact be better.)



My question is about the performance characteristics of the proposed syscall-based solution and why they might be faster. I can think of a few reasons why performance might be better:




  1. POSIX readdir is inherently more complicated than syscall(SYS_getdents,...)/getdents()


  2. readdir (which presumably calls syscall(SYS_getdents,...) simply adds the overhead of indirection


  3. readdir only returns one record (per kernel-call) versus syscall(SYS_getdents,...)/getdents()` which returns (presumably) more than one record per kernel-call


I can't imagine that #1 above is true. readdir and getdents are so similar that the implementation of readdir in glibc simply can't have many more "true" system calls than a direct-invocation of syscall(SYS_getdents,...)/getdents() would invoke.



I can't imagine that #2 is true, either, since calling readdir likely wraps getdents and also syscall(SYS_getdents,...) likely calls getdents as well (the proposed answer specifically uses syscall(SYS_getdents,...) instead of calling getdents directly. It's possible that everything within glibc on Linux boils down to syscall(syscallid, args) in which case #2 probably is true.



The last possibility seems to me to be the best explanation: fewer calls into the kernel simply results in faster performance.



Is there any specific explanation for why a "direct kernel call" would be measurably faster than calling a POSIX-defined function?










share|improve this question




















  • 2





    POSIX does not specify any system calls, it specifies APIs

    – fpmurphy
    Jan 10 '18 at 2:14













  • @fpmurphy1 Pardon me for misspeaking. I meant POSIX-defined APIs not system calls, of course.

    – Christopher Schultz
    Jan 10 '18 at 21:08














2












2








2


3






In an answer over on Stack Overflow, I provided a code sample to perform some small task referenced in the question. The original question had to do with the fastest-performing technique (so performance criteria are in play, here).



Another commenter/answerer suggested that making a POSIX-defined system API call (in this case, readdir) was not as fast as making a direct system call into the kernel (syscall(SYS_getdents,...)) and the claimed performance difference is in the 25% range. (I didn't implement and re-benchmark; I believe that the performance could in fact be better.)



My question is about the performance characteristics of the proposed syscall-based solution and why they might be faster. I can think of a few reasons why performance might be better:




  1. POSIX readdir is inherently more complicated than syscall(SYS_getdents,...)/getdents()


  2. readdir (which presumably calls syscall(SYS_getdents,...) simply adds the overhead of indirection


  3. readdir only returns one record (per kernel-call) versus syscall(SYS_getdents,...)/getdents()` which returns (presumably) more than one record per kernel-call


I can't imagine that #1 above is true. readdir and getdents are so similar that the implementation of readdir in glibc simply can't have many more "true" system calls than a direct-invocation of syscall(SYS_getdents,...)/getdents() would invoke.



I can't imagine that #2 is true, either, since calling readdir likely wraps getdents and also syscall(SYS_getdents,...) likely calls getdents as well (the proposed answer specifically uses syscall(SYS_getdents,...) instead of calling getdents directly. It's possible that everything within glibc on Linux boils down to syscall(syscallid, args) in which case #2 probably is true.



The last possibility seems to me to be the best explanation: fewer calls into the kernel simply results in faster performance.



Is there any specific explanation for why a "direct kernel call" would be measurably faster than calling a POSIX-defined function?










share|improve this question
















In an answer over on Stack Overflow, I provided a code sample to perform some small task referenced in the question. The original question had to do with the fastest-performing technique (so performance criteria are in play, here).



Another commenter/answerer suggested that making a POSIX-defined system API call (in this case, readdir) was not as fast as making a direct system call into the kernel (syscall(SYS_getdents,...)) and the claimed performance difference is in the 25% range. (I didn't implement and re-benchmark; I believe that the performance could in fact be better.)



My question is about the performance characteristics of the proposed syscall-based solution and why they might be faster. I can think of a few reasons why performance might be better:




  1. POSIX readdir is inherently more complicated than syscall(SYS_getdents,...)/getdents()


  2. readdir (which presumably calls syscall(SYS_getdents,...) simply adds the overhead of indirection


  3. readdir only returns one record (per kernel-call) versus syscall(SYS_getdents,...)/getdents()` which returns (presumably) more than one record per kernel-call


I can't imagine that #1 above is true. readdir and getdents are so similar that the implementation of readdir in glibc simply can't have many more "true" system calls than a direct-invocation of syscall(SYS_getdents,...)/getdents() would invoke.



I can't imagine that #2 is true, either, since calling readdir likely wraps getdents and also syscall(SYS_getdents,...) likely calls getdents as well (the proposed answer specifically uses syscall(SYS_getdents,...) instead of calling getdents directly. It's possible that everything within glibc on Linux boils down to syscall(syscallid, args) in which case #2 probably is true.



The last possibility seems to me to be the best explanation: fewer calls into the kernel simply results in faster performance.



Is there any specific explanation for why a "direct kernel call" would be measurably faster than calling a POSIX-defined function?







linux performance posix glibc syscalls






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 10 '18 at 21:10







Christopher Schultz

















asked Jan 9 '18 at 22:56









Christopher SchultzChristopher Schultz

1671 silver badge7 bronze badges




1671 silver badge7 bronze badges








  • 2





    POSIX does not specify any system calls, it specifies APIs

    – fpmurphy
    Jan 10 '18 at 2:14













  • @fpmurphy1 Pardon me for misspeaking. I meant POSIX-defined APIs not system calls, of course.

    – Christopher Schultz
    Jan 10 '18 at 21:08














  • 2





    POSIX does not specify any system calls, it specifies APIs

    – fpmurphy
    Jan 10 '18 at 2:14













  • @fpmurphy1 Pardon me for misspeaking. I meant POSIX-defined APIs not system calls, of course.

    – Christopher Schultz
    Jan 10 '18 at 21:08








2




2





POSIX does not specify any system calls, it specifies APIs

– fpmurphy
Jan 10 '18 at 2:14







POSIX does not specify any system calls, it specifies APIs

– fpmurphy
Jan 10 '18 at 2:14















@fpmurphy1 Pardon me for misspeaking. I meant POSIX-defined APIs not system calls, of course.

– Christopher Schultz
Jan 10 '18 at 21:08





@fpmurphy1 Pardon me for misspeaking. I meant POSIX-defined APIs not system calls, of course.

– Christopher Schultz
Jan 10 '18 at 21:08










2 Answers
2






active

oldest

votes


















0














Functions like readdir() and friends are implemented in libc, which is a shared library. As with all shared libraries, that adds some redirection in order to be able to resolve the memory address of the function inside the shared library.



The first time any particular library call is performed, the dynamic linker needs to look up the address of the library call inside a hash table. This involves at least one (but possibly more) string comparisons, a comparatively expensive method. The found address is then saved in the PLT (procedure linkage table), so that the next time the function is called, the overhead of finding the function is reduced to three instructions (on the x86 architecture, fewer than that on some other architectures). This is why compiling something as a shared object (rather than a static object) has some overhead. For more information on shared library overhead and how shared libraries work on Linux, see Ulrich Drepper's detailed technical explanation on the subject.



The syscall() function itself is also implemented in libc, so it too has that redirection. However, since you would only use that function (and no other), the dynamic linker has less work to do. In addition, the implementation of a particular function such as readdir would have to convert the return values and do error checking etc upon exiting the syscall() function, which is some extra overhead. A program that runs syscall() directly would work with the direct return values of the system call and would not need that conversion (it would still need to do the error checking, which would complicate the function significantly).



The downside of running syscall() directly is that you move to an API that is less portable. The syscall() manpage explains some architecture-specific constraints that libc deals with for you; if you use syscall() directly, your function might work on the architecture you're dealing with, but would fail on, say, an arm machine.



In general, I would recommend against using the syscall() API directly, for the very same reason that I would recommend against writing the code in assembly language directly. Yes, that might end up being faster in the end, but the maintenance burden becomes (much) higher. Some things you could do instead:




  • Don't care about performance. Systems keep getting cheaper, and in many cases "adding another system so things go faster" is cheaper than "paying a programmer's hourly rates to improve performance".

  • Compile the software against static libraries rather than using shared libraries, for the few small things where performance is critical (i.e., gcc -static)

  • Use a profiler to see where things are going slow, and focus on those things, rather than worrying about how to do a system call.






share|improve this answer
























  • Thank you for your detailed response. I agree that direct system calls should be avoided and portability (and, perhaps more importantly, readability) is a more important factor than performance in general. In my question, I was specifically curious about the performance implications of using syscall directly as opposed to a POSIX-defined API which ultimately probably makes the same underlying system call. Could you clarify in your answer whether you think that "translation overhead" or "less frequent kernel calls" is likely to make the greater performance difference, here?

    – Christopher Schultz
    Jan 10 '18 at 21:06











  • He's saying that there is a one-time cost in the glibc version of the call due to the overhead of dynamic linking. There is no reduction in the number of system calls in the usual case of calling a function that always just calls the underlying system call. In some cases, calling the "POSIX" calls (really just library calls) may have a huge reduction: something like getc will only make an occasional kernel call to read many characters and then cache the result in-process, so most calls don't go the kernel at all.

    – BeeOnRope
    Feb 21 '18 at 19:29











  • Wouter - interesting point about the PLT, but you might point out this is a one-time cost: if the cost of any given system call is important, you are likely calling it many times (or else it wouldn't be important) and the PLT cost goes to zero (in a relative sense). In my benchmarks, using syscall(...) is slower than calling the library wrapper (by about 4 cycles, generally), probably because syscall is a generic varargs method and has to shuffle all the arguments around to make them line up with the syscall interface, while the wrapper functions have less of an issue.

    – BeeOnRope
    Feb 21 '18 at 19:32













  • @BeeOnRope I actually did point that out ;-) but my main point was "don't worry about performance in a generic sense, worry about the output of your profiler". I stand by that, as your point shows, too.

    – Wouter Verhelst
    Feb 26 '18 at 17:11











  • Directory scans on Linux take from start at about 5µs for an empty directory. The PLT overhead (<1ns) or even the function call overhead (1-2ns) of readdir there is a drop in the bucket. Completely negligible.

    – PSkocik
    35 mins ago



















0














Factors like PLT indirection or syscall()'s variadic-ness (registers have to be saved to memory) should play little role given that getdents is one of the most expensive calls in Linux.



Fully reading an empty directory on my machine costs about 5µs, a 100-item directory 37µs, a 1000-item directory 340µs and a 10,000-item directory 3.79ms.



What fdopen+readdir does on top of getdents is it adds a buffer allocation/deallocation (0.1µs) and a stat check that the supplied fd is of the directory variety (0.4µs).



So there is something like a one-time overhead of 0.5µs, which is 10% of the directory scanning time for empty directories, but only 1% for 100-item directories and negligibly little for larger ones. This overhead is 5 times less (only the allocation/deallocation cost) if you you don't need fdopen. (you only need fdopen if you can't diropen directly and therefore must go through a separately obtained (e.g., openat'ted) filedescriptor).



So if you use a custom one-time allocated buffer along with getdents, you can save 2-10% on the scanning cost of empty directories and negligibly little on the bigger ones.





(The cost of PLT indirection on modern hardware is typically less than a 1ns, function call overhead is about 1-2ns. Given that the directory scanning times are in the order of microseconds, you'd need to make at least 1000 calls for these factors to translate into a single µs, but then the scanning cost is 340µs and the accrued extra µs is like 0.3% -- negligible effect.)






share|improve this answer


























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "106"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f415944%2fperformance-of-calling-posix-specified-functions-versus-direct-linux-kernel-call%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Functions like readdir() and friends are implemented in libc, which is a shared library. As with all shared libraries, that adds some redirection in order to be able to resolve the memory address of the function inside the shared library.



    The first time any particular library call is performed, the dynamic linker needs to look up the address of the library call inside a hash table. This involves at least one (but possibly more) string comparisons, a comparatively expensive method. The found address is then saved in the PLT (procedure linkage table), so that the next time the function is called, the overhead of finding the function is reduced to three instructions (on the x86 architecture, fewer than that on some other architectures). This is why compiling something as a shared object (rather than a static object) has some overhead. For more information on shared library overhead and how shared libraries work on Linux, see Ulrich Drepper's detailed technical explanation on the subject.



    The syscall() function itself is also implemented in libc, so it too has that redirection. However, since you would only use that function (and no other), the dynamic linker has less work to do. In addition, the implementation of a particular function such as readdir would have to convert the return values and do error checking etc upon exiting the syscall() function, which is some extra overhead. A program that runs syscall() directly would work with the direct return values of the system call and would not need that conversion (it would still need to do the error checking, which would complicate the function significantly).



    The downside of running syscall() directly is that you move to an API that is less portable. The syscall() manpage explains some architecture-specific constraints that libc deals with for you; if you use syscall() directly, your function might work on the architecture you're dealing with, but would fail on, say, an arm machine.



    In general, I would recommend against using the syscall() API directly, for the very same reason that I would recommend against writing the code in assembly language directly. Yes, that might end up being faster in the end, but the maintenance burden becomes (much) higher. Some things you could do instead:




    • Don't care about performance. Systems keep getting cheaper, and in many cases "adding another system so things go faster" is cheaper than "paying a programmer's hourly rates to improve performance".

    • Compile the software against static libraries rather than using shared libraries, for the few small things where performance is critical (i.e., gcc -static)

    • Use a profiler to see where things are going slow, and focus on those things, rather than worrying about how to do a system call.






    share|improve this answer
























    • Thank you for your detailed response. I agree that direct system calls should be avoided and portability (and, perhaps more importantly, readability) is a more important factor than performance in general. In my question, I was specifically curious about the performance implications of using syscall directly as opposed to a POSIX-defined API which ultimately probably makes the same underlying system call. Could you clarify in your answer whether you think that "translation overhead" or "less frequent kernel calls" is likely to make the greater performance difference, here?

      – Christopher Schultz
      Jan 10 '18 at 21:06











    • He's saying that there is a one-time cost in the glibc version of the call due to the overhead of dynamic linking. There is no reduction in the number of system calls in the usual case of calling a function that always just calls the underlying system call. In some cases, calling the "POSIX" calls (really just library calls) may have a huge reduction: something like getc will only make an occasional kernel call to read many characters and then cache the result in-process, so most calls don't go the kernel at all.

      – BeeOnRope
      Feb 21 '18 at 19:29











    • Wouter - interesting point about the PLT, but you might point out this is a one-time cost: if the cost of any given system call is important, you are likely calling it many times (or else it wouldn't be important) and the PLT cost goes to zero (in a relative sense). In my benchmarks, using syscall(...) is slower than calling the library wrapper (by about 4 cycles, generally), probably because syscall is a generic varargs method and has to shuffle all the arguments around to make them line up with the syscall interface, while the wrapper functions have less of an issue.

      – BeeOnRope
      Feb 21 '18 at 19:32













    • @BeeOnRope I actually did point that out ;-) but my main point was "don't worry about performance in a generic sense, worry about the output of your profiler". I stand by that, as your point shows, too.

      – Wouter Verhelst
      Feb 26 '18 at 17:11











    • Directory scans on Linux take from start at about 5µs for an empty directory. The PLT overhead (<1ns) or even the function call overhead (1-2ns) of readdir there is a drop in the bucket. Completely negligible.

      – PSkocik
      35 mins ago
















    0














    Functions like readdir() and friends are implemented in libc, which is a shared library. As with all shared libraries, that adds some redirection in order to be able to resolve the memory address of the function inside the shared library.



    The first time any particular library call is performed, the dynamic linker needs to look up the address of the library call inside a hash table. This involves at least one (but possibly more) string comparisons, a comparatively expensive method. The found address is then saved in the PLT (procedure linkage table), so that the next time the function is called, the overhead of finding the function is reduced to three instructions (on the x86 architecture, fewer than that on some other architectures). This is why compiling something as a shared object (rather than a static object) has some overhead. For more information on shared library overhead and how shared libraries work on Linux, see Ulrich Drepper's detailed technical explanation on the subject.



    The syscall() function itself is also implemented in libc, so it too has that redirection. However, since you would only use that function (and no other), the dynamic linker has less work to do. In addition, the implementation of a particular function such as readdir would have to convert the return values and do error checking etc upon exiting the syscall() function, which is some extra overhead. A program that runs syscall() directly would work with the direct return values of the system call and would not need that conversion (it would still need to do the error checking, which would complicate the function significantly).



    The downside of running syscall() directly is that you move to an API that is less portable. The syscall() manpage explains some architecture-specific constraints that libc deals with for you; if you use syscall() directly, your function might work on the architecture you're dealing with, but would fail on, say, an arm machine.



    In general, I would recommend against using the syscall() API directly, for the very same reason that I would recommend against writing the code in assembly language directly. Yes, that might end up being faster in the end, but the maintenance burden becomes (much) higher. Some things you could do instead:




    • Don't care about performance. Systems keep getting cheaper, and in many cases "adding another system so things go faster" is cheaper than "paying a programmer's hourly rates to improve performance".

    • Compile the software against static libraries rather than using shared libraries, for the few small things where performance is critical (i.e., gcc -static)

    • Use a profiler to see where things are going slow, and focus on those things, rather than worrying about how to do a system call.






    share|improve this answer
























    • Thank you for your detailed response. I agree that direct system calls should be avoided and portability (and, perhaps more importantly, readability) is a more important factor than performance in general. In my question, I was specifically curious about the performance implications of using syscall directly as opposed to a POSIX-defined API which ultimately probably makes the same underlying system call. Could you clarify in your answer whether you think that "translation overhead" or "less frequent kernel calls" is likely to make the greater performance difference, here?

      – Christopher Schultz
      Jan 10 '18 at 21:06











    • He's saying that there is a one-time cost in the glibc version of the call due to the overhead of dynamic linking. There is no reduction in the number of system calls in the usual case of calling a function that always just calls the underlying system call. In some cases, calling the "POSIX" calls (really just library calls) may have a huge reduction: something like getc will only make an occasional kernel call to read many characters and then cache the result in-process, so most calls don't go the kernel at all.

      – BeeOnRope
      Feb 21 '18 at 19:29











    • Wouter - interesting point about the PLT, but you might point out this is a one-time cost: if the cost of any given system call is important, you are likely calling it many times (or else it wouldn't be important) and the PLT cost goes to zero (in a relative sense). In my benchmarks, using syscall(...) is slower than calling the library wrapper (by about 4 cycles, generally), probably because syscall is a generic varargs method and has to shuffle all the arguments around to make them line up with the syscall interface, while the wrapper functions have less of an issue.

      – BeeOnRope
      Feb 21 '18 at 19:32













    • @BeeOnRope I actually did point that out ;-) but my main point was "don't worry about performance in a generic sense, worry about the output of your profiler". I stand by that, as your point shows, too.

      – Wouter Verhelst
      Feb 26 '18 at 17:11











    • Directory scans on Linux take from start at about 5µs for an empty directory. The PLT overhead (<1ns) or even the function call overhead (1-2ns) of readdir there is a drop in the bucket. Completely negligible.

      – PSkocik
      35 mins ago














    0












    0








    0







    Functions like readdir() and friends are implemented in libc, which is a shared library. As with all shared libraries, that adds some redirection in order to be able to resolve the memory address of the function inside the shared library.



    The first time any particular library call is performed, the dynamic linker needs to look up the address of the library call inside a hash table. This involves at least one (but possibly more) string comparisons, a comparatively expensive method. The found address is then saved in the PLT (procedure linkage table), so that the next time the function is called, the overhead of finding the function is reduced to three instructions (on the x86 architecture, fewer than that on some other architectures). This is why compiling something as a shared object (rather than a static object) has some overhead. For more information on shared library overhead and how shared libraries work on Linux, see Ulrich Drepper's detailed technical explanation on the subject.



    The syscall() function itself is also implemented in libc, so it too has that redirection. However, since you would only use that function (and no other), the dynamic linker has less work to do. In addition, the implementation of a particular function such as readdir would have to convert the return values and do error checking etc upon exiting the syscall() function, which is some extra overhead. A program that runs syscall() directly would work with the direct return values of the system call and would not need that conversion (it would still need to do the error checking, which would complicate the function significantly).



    The downside of running syscall() directly is that you move to an API that is less portable. The syscall() manpage explains some architecture-specific constraints that libc deals with for you; if you use syscall() directly, your function might work on the architecture you're dealing with, but would fail on, say, an arm machine.



    In general, I would recommend against using the syscall() API directly, for the very same reason that I would recommend against writing the code in assembly language directly. Yes, that might end up being faster in the end, but the maintenance burden becomes (much) higher. Some things you could do instead:




    • Don't care about performance. Systems keep getting cheaper, and in many cases "adding another system so things go faster" is cheaper than "paying a programmer's hourly rates to improve performance".

    • Compile the software against static libraries rather than using shared libraries, for the few small things where performance is critical (i.e., gcc -static)

    • Use a profiler to see where things are going slow, and focus on those things, rather than worrying about how to do a system call.






    share|improve this answer













    Functions like readdir() and friends are implemented in libc, which is a shared library. As with all shared libraries, that adds some redirection in order to be able to resolve the memory address of the function inside the shared library.



    The first time any particular library call is performed, the dynamic linker needs to look up the address of the library call inside a hash table. This involves at least one (but possibly more) string comparisons, a comparatively expensive method. The found address is then saved in the PLT (procedure linkage table), so that the next time the function is called, the overhead of finding the function is reduced to three instructions (on the x86 architecture, fewer than that on some other architectures). This is why compiling something as a shared object (rather than a static object) has some overhead. For more information on shared library overhead and how shared libraries work on Linux, see Ulrich Drepper's detailed technical explanation on the subject.



    The syscall() function itself is also implemented in libc, so it too has that redirection. However, since you would only use that function (and no other), the dynamic linker has less work to do. In addition, the implementation of a particular function such as readdir would have to convert the return values and do error checking etc upon exiting the syscall() function, which is some extra overhead. A program that runs syscall() directly would work with the direct return values of the system call and would not need that conversion (it would still need to do the error checking, which would complicate the function significantly).



    The downside of running syscall() directly is that you move to an API that is less portable. The syscall() manpage explains some architecture-specific constraints that libc deals with for you; if you use syscall() directly, your function might work on the architecture you're dealing with, but would fail on, say, an arm machine.



    In general, I would recommend against using the syscall() API directly, for the very same reason that I would recommend against writing the code in assembly language directly. Yes, that might end up being faster in the end, but the maintenance burden becomes (much) higher. Some things you could do instead:




    • Don't care about performance. Systems keep getting cheaper, and in many cases "adding another system so things go faster" is cheaper than "paying a programmer's hourly rates to improve performance".

    • Compile the software against static libraries rather than using shared libraries, for the few small things where performance is critical (i.e., gcc -static)

    • Use a profiler to see where things are going slow, and focus on those things, rather than worrying about how to do a system call.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Jan 10 '18 at 11:35









    Wouter VerhelstWouter Verhelst

    7,68710 silver badges35 bronze badges




    7,68710 silver badges35 bronze badges













    • Thank you for your detailed response. I agree that direct system calls should be avoided and portability (and, perhaps more importantly, readability) is a more important factor than performance in general. In my question, I was specifically curious about the performance implications of using syscall directly as opposed to a POSIX-defined API which ultimately probably makes the same underlying system call. Could you clarify in your answer whether you think that "translation overhead" or "less frequent kernel calls" is likely to make the greater performance difference, here?

      – Christopher Schultz
      Jan 10 '18 at 21:06











    • He's saying that there is a one-time cost in the glibc version of the call due to the overhead of dynamic linking. There is no reduction in the number of system calls in the usual case of calling a function that always just calls the underlying system call. In some cases, calling the "POSIX" calls (really just library calls) may have a huge reduction: something like getc will only make an occasional kernel call to read many characters and then cache the result in-process, so most calls don't go the kernel at all.

      – BeeOnRope
      Feb 21 '18 at 19:29











    • Wouter - interesting point about the PLT, but you might point out this is a one-time cost: if the cost of any given system call is important, you are likely calling it many times (or else it wouldn't be important) and the PLT cost goes to zero (in a relative sense). In my benchmarks, using syscall(...) is slower than calling the library wrapper (by about 4 cycles, generally), probably because syscall is a generic varargs method and has to shuffle all the arguments around to make them line up with the syscall interface, while the wrapper functions have less of an issue.

      – BeeOnRope
      Feb 21 '18 at 19:32













    • @BeeOnRope I actually did point that out ;-) but my main point was "don't worry about performance in a generic sense, worry about the output of your profiler". I stand by that, as your point shows, too.

      – Wouter Verhelst
      Feb 26 '18 at 17:11











    • Directory scans on Linux take from start at about 5µs for an empty directory. The PLT overhead (<1ns) or even the function call overhead (1-2ns) of readdir there is a drop in the bucket. Completely negligible.

      – PSkocik
      35 mins ago



















    • Thank you for your detailed response. I agree that direct system calls should be avoided and portability (and, perhaps more importantly, readability) is a more important factor than performance in general. In my question, I was specifically curious about the performance implications of using syscall directly as opposed to a POSIX-defined API which ultimately probably makes the same underlying system call. Could you clarify in your answer whether you think that "translation overhead" or "less frequent kernel calls" is likely to make the greater performance difference, here?

      – Christopher Schultz
      Jan 10 '18 at 21:06











    • He's saying that there is a one-time cost in the glibc version of the call due to the overhead of dynamic linking. There is no reduction in the number of system calls in the usual case of calling a function that always just calls the underlying system call. In some cases, calling the "POSIX" calls (really just library calls) may have a huge reduction: something like getc will only make an occasional kernel call to read many characters and then cache the result in-process, so most calls don't go the kernel at all.

      – BeeOnRope
      Feb 21 '18 at 19:29











    • Wouter - interesting point about the PLT, but you might point out this is a one-time cost: if the cost of any given system call is important, you are likely calling it many times (or else it wouldn't be important) and the PLT cost goes to zero (in a relative sense). In my benchmarks, using syscall(...) is slower than calling the library wrapper (by about 4 cycles, generally), probably because syscall is a generic varargs method and has to shuffle all the arguments around to make them line up with the syscall interface, while the wrapper functions have less of an issue.

      – BeeOnRope
      Feb 21 '18 at 19:32













    • @BeeOnRope I actually did point that out ;-) but my main point was "don't worry about performance in a generic sense, worry about the output of your profiler". I stand by that, as your point shows, too.

      – Wouter Verhelst
      Feb 26 '18 at 17:11











    • Directory scans on Linux take from start at about 5µs for an empty directory. The PLT overhead (<1ns) or even the function call overhead (1-2ns) of readdir there is a drop in the bucket. Completely negligible.

      – PSkocik
      35 mins ago

















    Thank you for your detailed response. I agree that direct system calls should be avoided and portability (and, perhaps more importantly, readability) is a more important factor than performance in general. In my question, I was specifically curious about the performance implications of using syscall directly as opposed to a POSIX-defined API which ultimately probably makes the same underlying system call. Could you clarify in your answer whether you think that "translation overhead" or "less frequent kernel calls" is likely to make the greater performance difference, here?

    – Christopher Schultz
    Jan 10 '18 at 21:06





    Thank you for your detailed response. I agree that direct system calls should be avoided and portability (and, perhaps more importantly, readability) is a more important factor than performance in general. In my question, I was specifically curious about the performance implications of using syscall directly as opposed to a POSIX-defined API which ultimately probably makes the same underlying system call. Could you clarify in your answer whether you think that "translation overhead" or "less frequent kernel calls" is likely to make the greater performance difference, here?

    – Christopher Schultz
    Jan 10 '18 at 21:06













    He's saying that there is a one-time cost in the glibc version of the call due to the overhead of dynamic linking. There is no reduction in the number of system calls in the usual case of calling a function that always just calls the underlying system call. In some cases, calling the "POSIX" calls (really just library calls) may have a huge reduction: something like getc will only make an occasional kernel call to read many characters and then cache the result in-process, so most calls don't go the kernel at all.

    – BeeOnRope
    Feb 21 '18 at 19:29





    He's saying that there is a one-time cost in the glibc version of the call due to the overhead of dynamic linking. There is no reduction in the number of system calls in the usual case of calling a function that always just calls the underlying system call. In some cases, calling the "POSIX" calls (really just library calls) may have a huge reduction: something like getc will only make an occasional kernel call to read many characters and then cache the result in-process, so most calls don't go the kernel at all.

    – BeeOnRope
    Feb 21 '18 at 19:29













    Wouter - interesting point about the PLT, but you might point out this is a one-time cost: if the cost of any given system call is important, you are likely calling it many times (or else it wouldn't be important) and the PLT cost goes to zero (in a relative sense). In my benchmarks, using syscall(...) is slower than calling the library wrapper (by about 4 cycles, generally), probably because syscall is a generic varargs method and has to shuffle all the arguments around to make them line up with the syscall interface, while the wrapper functions have less of an issue.

    – BeeOnRope
    Feb 21 '18 at 19:32







    Wouter - interesting point about the PLT, but you might point out this is a one-time cost: if the cost of any given system call is important, you are likely calling it many times (or else it wouldn't be important) and the PLT cost goes to zero (in a relative sense). In my benchmarks, using syscall(...) is slower than calling the library wrapper (by about 4 cycles, generally), probably because syscall is a generic varargs method and has to shuffle all the arguments around to make them line up with the syscall interface, while the wrapper functions have less of an issue.

    – BeeOnRope
    Feb 21 '18 at 19:32















    @BeeOnRope I actually did point that out ;-) but my main point was "don't worry about performance in a generic sense, worry about the output of your profiler". I stand by that, as your point shows, too.

    – Wouter Verhelst
    Feb 26 '18 at 17:11





    @BeeOnRope I actually did point that out ;-) but my main point was "don't worry about performance in a generic sense, worry about the output of your profiler". I stand by that, as your point shows, too.

    – Wouter Verhelst
    Feb 26 '18 at 17:11













    Directory scans on Linux take from start at about 5µs for an empty directory. The PLT overhead (<1ns) or even the function call overhead (1-2ns) of readdir there is a drop in the bucket. Completely negligible.

    – PSkocik
    35 mins ago





    Directory scans on Linux take from start at about 5µs for an empty directory. The PLT overhead (<1ns) or even the function call overhead (1-2ns) of readdir there is a drop in the bucket. Completely negligible.

    – PSkocik
    35 mins ago













    0














    Factors like PLT indirection or syscall()'s variadic-ness (registers have to be saved to memory) should play little role given that getdents is one of the most expensive calls in Linux.



    Fully reading an empty directory on my machine costs about 5µs, a 100-item directory 37µs, a 1000-item directory 340µs and a 10,000-item directory 3.79ms.



    What fdopen+readdir does on top of getdents is it adds a buffer allocation/deallocation (0.1µs) and a stat check that the supplied fd is of the directory variety (0.4µs).



    So there is something like a one-time overhead of 0.5µs, which is 10% of the directory scanning time for empty directories, but only 1% for 100-item directories and negligibly little for larger ones. This overhead is 5 times less (only the allocation/deallocation cost) if you you don't need fdopen. (you only need fdopen if you can't diropen directly and therefore must go through a separately obtained (e.g., openat'ted) filedescriptor).



    So if you use a custom one-time allocated buffer along with getdents, you can save 2-10% on the scanning cost of empty directories and negligibly little on the bigger ones.





    (The cost of PLT indirection on modern hardware is typically less than a 1ns, function call overhead is about 1-2ns. Given that the directory scanning times are in the order of microseconds, you'd need to make at least 1000 calls for these factors to translate into a single µs, but then the scanning cost is 340µs and the accrued extra µs is like 0.3% -- negligible effect.)






    share|improve this answer




























      0














      Factors like PLT indirection or syscall()'s variadic-ness (registers have to be saved to memory) should play little role given that getdents is one of the most expensive calls in Linux.



      Fully reading an empty directory on my machine costs about 5µs, a 100-item directory 37µs, a 1000-item directory 340µs and a 10,000-item directory 3.79ms.



      What fdopen+readdir does on top of getdents is it adds a buffer allocation/deallocation (0.1µs) and a stat check that the supplied fd is of the directory variety (0.4µs).



      So there is something like a one-time overhead of 0.5µs, which is 10% of the directory scanning time for empty directories, but only 1% for 100-item directories and negligibly little for larger ones. This overhead is 5 times less (only the allocation/deallocation cost) if you you don't need fdopen. (you only need fdopen if you can't diropen directly and therefore must go through a separately obtained (e.g., openat'ted) filedescriptor).



      So if you use a custom one-time allocated buffer along with getdents, you can save 2-10% on the scanning cost of empty directories and negligibly little on the bigger ones.





      (The cost of PLT indirection on modern hardware is typically less than a 1ns, function call overhead is about 1-2ns. Given that the directory scanning times are in the order of microseconds, you'd need to make at least 1000 calls for these factors to translate into a single µs, but then the scanning cost is 340µs and the accrued extra µs is like 0.3% -- negligible effect.)






      share|improve this answer


























        0












        0








        0







        Factors like PLT indirection or syscall()'s variadic-ness (registers have to be saved to memory) should play little role given that getdents is one of the most expensive calls in Linux.



        Fully reading an empty directory on my machine costs about 5µs, a 100-item directory 37µs, a 1000-item directory 340µs and a 10,000-item directory 3.79ms.



        What fdopen+readdir does on top of getdents is it adds a buffer allocation/deallocation (0.1µs) and a stat check that the supplied fd is of the directory variety (0.4µs).



        So there is something like a one-time overhead of 0.5µs, which is 10% of the directory scanning time for empty directories, but only 1% for 100-item directories and negligibly little for larger ones. This overhead is 5 times less (only the allocation/deallocation cost) if you you don't need fdopen. (you only need fdopen if you can't diropen directly and therefore must go through a separately obtained (e.g., openat'ted) filedescriptor).



        So if you use a custom one-time allocated buffer along with getdents, you can save 2-10% on the scanning cost of empty directories and negligibly little on the bigger ones.





        (The cost of PLT indirection on modern hardware is typically less than a 1ns, function call overhead is about 1-2ns. Given that the directory scanning times are in the order of microseconds, you'd need to make at least 1000 calls for these factors to translate into a single µs, but then the scanning cost is 340µs and the accrued extra µs is like 0.3% -- negligible effect.)






        share|improve this answer













        Factors like PLT indirection or syscall()'s variadic-ness (registers have to be saved to memory) should play little role given that getdents is one of the most expensive calls in Linux.



        Fully reading an empty directory on my machine costs about 5µs, a 100-item directory 37µs, a 1000-item directory 340µs and a 10,000-item directory 3.79ms.



        What fdopen+readdir does on top of getdents is it adds a buffer allocation/deallocation (0.1µs) and a stat check that the supplied fd is of the directory variety (0.4µs).



        So there is something like a one-time overhead of 0.5µs, which is 10% of the directory scanning time for empty directories, but only 1% for 100-item directories and negligibly little for larger ones. This overhead is 5 times less (only the allocation/deallocation cost) if you you don't need fdopen. (you only need fdopen if you can't diropen directly and therefore must go through a separately obtained (e.g., openat'ted) filedescriptor).



        So if you use a custom one-time allocated buffer along with getdents, you can save 2-10% on the scanning cost of empty directories and negligibly little on the bigger ones.





        (The cost of PLT indirection on modern hardware is typically less than a 1ns, function call overhead is about 1-2ns. Given that the directory scanning times are in the order of microseconds, you'd need to make at least 1000 calls for these factors to translate into a single µs, but then the scanning cost is 340µs and the accrued extra µs is like 0.3% -- negligible effect.)







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered 42 mins ago









        PSkocikPSkocik

        19k7 gold badges55 silver badges106 bronze badges




        19k7 gold badges55 silver badges106 bronze badges






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Unix & Linux Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f415944%2fperformance-of-calling-posix-specified-functions-versus-direct-linux-kernel-call%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Hudson River Historic District Contents Geography History The district today Aesthetics Cultural...

            The number designs the writing. Feandra Aversely Definition: The act of ingrafting a sprig or shoot of one...

            Ayherre Geografie Demografie Externe links Navigatiemenu43° 23′ NB, 1° 15′ WL43° 23′ NB, 1°...