Performance of calling POSIX-specified functions versus direct Linux kernel callsDiagram of Linux kernel vs....
Is this floating-point optimization allowed?
I quit, and boss offered me 3 month "grace period" where I could still come back
Too many spies!
nginx serves wrong domain site. It doenst shows default site if no configuration applies
Why do they not say "The Baby"
Help with understanding nuances of extremely popular Kyoto-ben (?) tweet
TikZ Can I draw an arrow by specifying the initial point, direction, and length?
Find values of x so that the matrix is invertible
Why limit to revolvers?
Won 50K! Now what should I do with it
Why did the Japanese attack the Aleutians at the same time as Midway?
Why is dry soil hydrophobic? Bad gardener paradox
What is temperature on a quantum level?
Cutting machine can't read vectors with strokes
How does one stock fund's charge of 1% more in operating expenses than another fund lower expected returns by 10%?
When is pointing out a person's hypocrisy not considered to be a logical fallacy?
Interpreting the word "randomly"
How can an advanced civilization forget how to manufacture its technology?
Supporting developers who insist on using their pet language
What is the English equivalent of 干物女 (dried fish woman)?
Can I intentionally omit previous work experience or pretend it doesn't exist when applying for jobs?
What's the point of this scene involving Flash Thompson at the airport?
If the derivative of a function is square of it then it is constant
Filtering fine silt/mud from water (not necessarily bacteria etc.)
Performance of calling POSIX-specified functions versus direct Linux kernel calls
Diagram of Linux kernel vs. performance tools?What is meant by “a system call” if not the implementation in the programing language?Cross compiling GLIBC for my ARM SoC`EINTR`: is there a rationale behind it?Why don't Linux utils use a system call to get the current time?`umount -R` on bind mounts takes a non-neglible amount of time, why?Why did the system call registers and order change from Intel 32bit to 64bit?FreeBSD vs Linux: performance of kernel calling conventionsCould AIO fsync improve dpkg performance?Why does rename() take longer when fsync() is called first?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}
In an answer over on Stack Overflow, I provided a code sample to perform some small task referenced in the question. The original question had to do with the fastest-performing technique (so performance criteria are in play, here).
Another commenter/answerer suggested that making a POSIX-defined system API call (in this case, readdir) was not as fast as making a direct system call into the kernel (syscall(SYS_getdents,...)) and the claimed performance difference is in the 25% range. (I didn't implement and re-benchmark; I believe that the performance could in fact be better.)
My question is about the performance characteristics of the proposed syscall-based solution and why they might be faster. I can think of a few reasons why performance might be better:
- POSIX
readdiris inherently more complicated thansyscall(SYS_getdents,...)/getdents()
readdir(which presumably callssyscall(SYS_getdents,...)simply adds the overhead of indirection
readdironly returns one record (per kernel-call) versussyscall(SYS_getdents,...)/getdents()` which returns (presumably) more than one record per kernel-call
I can't imagine that #1 above is true. readdir and getdents are so similar that the implementation of readdir in glibc simply can't have many more "true" system calls than a direct-invocation of syscall(SYS_getdents,...)/getdents() would invoke.
I can't imagine that #2 is true, either, since calling readdir likely wraps getdents and also syscall(SYS_getdents,...) likely calls getdents as well (the proposed answer specifically uses syscall(SYS_getdents,...) instead of calling getdents directly. It's possible that everything within glibc on Linux boils down to syscall(syscallid, args) in which case #2 probably is true.
The last possibility seems to me to be the best explanation: fewer calls into the kernel simply results in faster performance.
Is there any specific explanation for why a "direct kernel call" would be measurably faster than calling a POSIX-defined function?
linux performance posix glibc syscalls
add a comment |
In an answer over on Stack Overflow, I provided a code sample to perform some small task referenced in the question. The original question had to do with the fastest-performing technique (so performance criteria are in play, here).
Another commenter/answerer suggested that making a POSIX-defined system API call (in this case, readdir) was not as fast as making a direct system call into the kernel (syscall(SYS_getdents,...)) and the claimed performance difference is in the 25% range. (I didn't implement and re-benchmark; I believe that the performance could in fact be better.)
My question is about the performance characteristics of the proposed syscall-based solution and why they might be faster. I can think of a few reasons why performance might be better:
- POSIX
readdiris inherently more complicated thansyscall(SYS_getdents,...)/getdents()
readdir(which presumably callssyscall(SYS_getdents,...)simply adds the overhead of indirection
readdironly returns one record (per kernel-call) versussyscall(SYS_getdents,...)/getdents()` which returns (presumably) more than one record per kernel-call
I can't imagine that #1 above is true. readdir and getdents are so similar that the implementation of readdir in glibc simply can't have many more "true" system calls than a direct-invocation of syscall(SYS_getdents,...)/getdents() would invoke.
I can't imagine that #2 is true, either, since calling readdir likely wraps getdents and also syscall(SYS_getdents,...) likely calls getdents as well (the proposed answer specifically uses syscall(SYS_getdents,...) instead of calling getdents directly. It's possible that everything within glibc on Linux boils down to syscall(syscallid, args) in which case #2 probably is true.
The last possibility seems to me to be the best explanation: fewer calls into the kernel simply results in faster performance.
Is there any specific explanation for why a "direct kernel call" would be measurably faster than calling a POSIX-defined function?
linux performance posix glibc syscalls
2
POSIX does not specify any system calls, it specifies APIs
– fpmurphy
Jan 10 '18 at 2:14
@fpmurphy1 Pardon me for misspeaking. I meant POSIX-defined APIs not system calls, of course.
– Christopher Schultz
Jan 10 '18 at 21:08
add a comment |
In an answer over on Stack Overflow, I provided a code sample to perform some small task referenced in the question. The original question had to do with the fastest-performing technique (so performance criteria are in play, here).
Another commenter/answerer suggested that making a POSIX-defined system API call (in this case, readdir) was not as fast as making a direct system call into the kernel (syscall(SYS_getdents,...)) and the claimed performance difference is in the 25% range. (I didn't implement and re-benchmark; I believe that the performance could in fact be better.)
My question is about the performance characteristics of the proposed syscall-based solution and why they might be faster. I can think of a few reasons why performance might be better:
- POSIX
readdiris inherently more complicated thansyscall(SYS_getdents,...)/getdents()
readdir(which presumably callssyscall(SYS_getdents,...)simply adds the overhead of indirection
readdironly returns one record (per kernel-call) versussyscall(SYS_getdents,...)/getdents()` which returns (presumably) more than one record per kernel-call
I can't imagine that #1 above is true. readdir and getdents are so similar that the implementation of readdir in glibc simply can't have many more "true" system calls than a direct-invocation of syscall(SYS_getdents,...)/getdents() would invoke.
I can't imagine that #2 is true, either, since calling readdir likely wraps getdents and also syscall(SYS_getdents,...) likely calls getdents as well (the proposed answer specifically uses syscall(SYS_getdents,...) instead of calling getdents directly. It's possible that everything within glibc on Linux boils down to syscall(syscallid, args) in which case #2 probably is true.
The last possibility seems to me to be the best explanation: fewer calls into the kernel simply results in faster performance.
Is there any specific explanation for why a "direct kernel call" would be measurably faster than calling a POSIX-defined function?
linux performance posix glibc syscalls
In an answer over on Stack Overflow, I provided a code sample to perform some small task referenced in the question. The original question had to do with the fastest-performing technique (so performance criteria are in play, here).
Another commenter/answerer suggested that making a POSIX-defined system API call (in this case, readdir) was not as fast as making a direct system call into the kernel (syscall(SYS_getdents,...)) and the claimed performance difference is in the 25% range. (I didn't implement and re-benchmark; I believe that the performance could in fact be better.)
My question is about the performance characteristics of the proposed syscall-based solution and why they might be faster. I can think of a few reasons why performance might be better:
- POSIX
readdiris inherently more complicated thansyscall(SYS_getdents,...)/getdents()
readdir(which presumably callssyscall(SYS_getdents,...)simply adds the overhead of indirection
readdironly returns one record (per kernel-call) versussyscall(SYS_getdents,...)/getdents()` which returns (presumably) more than one record per kernel-call
I can't imagine that #1 above is true. readdir and getdents are so similar that the implementation of readdir in glibc simply can't have many more "true" system calls than a direct-invocation of syscall(SYS_getdents,...)/getdents() would invoke.
I can't imagine that #2 is true, either, since calling readdir likely wraps getdents and also syscall(SYS_getdents,...) likely calls getdents as well (the proposed answer specifically uses syscall(SYS_getdents,...) instead of calling getdents directly. It's possible that everything within glibc on Linux boils down to syscall(syscallid, args) in which case #2 probably is true.
The last possibility seems to me to be the best explanation: fewer calls into the kernel simply results in faster performance.
Is there any specific explanation for why a "direct kernel call" would be measurably faster than calling a POSIX-defined function?
linux performance posix glibc syscalls
linux performance posix glibc syscalls
edited Jan 10 '18 at 21:10
Christopher Schultz
asked Jan 9 '18 at 22:56
Christopher SchultzChristopher Schultz
1671 silver badge7 bronze badges
1671 silver badge7 bronze badges
2
POSIX does not specify any system calls, it specifies APIs
– fpmurphy
Jan 10 '18 at 2:14
@fpmurphy1 Pardon me for misspeaking. I meant POSIX-defined APIs not system calls, of course.
– Christopher Schultz
Jan 10 '18 at 21:08
add a comment |
2
POSIX does not specify any system calls, it specifies APIs
– fpmurphy
Jan 10 '18 at 2:14
@fpmurphy1 Pardon me for misspeaking. I meant POSIX-defined APIs not system calls, of course.
– Christopher Schultz
Jan 10 '18 at 21:08
2
2
POSIX does not specify any system calls, it specifies APIs
– fpmurphy
Jan 10 '18 at 2:14
POSIX does not specify any system calls, it specifies APIs
– fpmurphy
Jan 10 '18 at 2:14
@fpmurphy1 Pardon me for misspeaking. I meant POSIX-defined APIs not system calls, of course.
– Christopher Schultz
Jan 10 '18 at 21:08
@fpmurphy1 Pardon me for misspeaking. I meant POSIX-defined APIs not system calls, of course.
– Christopher Schultz
Jan 10 '18 at 21:08
add a comment |
2 Answers
2
active
oldest
votes
Functions like readdir() and friends are implemented in libc, which is a shared library. As with all shared libraries, that adds some redirection in order to be able to resolve the memory address of the function inside the shared library.
The first time any particular library call is performed, the dynamic linker needs to look up the address of the library call inside a hash table. This involves at least one (but possibly more) string comparisons, a comparatively expensive method. The found address is then saved in the PLT (procedure linkage table), so that the next time the function is called, the overhead of finding the function is reduced to three instructions (on the x86 architecture, fewer than that on some other architectures). This is why compiling something as a shared object (rather than a static object) has some overhead. For more information on shared library overhead and how shared libraries work on Linux, see Ulrich Drepper's detailed technical explanation on the subject.
The syscall() function itself is also implemented in libc, so it too has that redirection. However, since you would only use that function (and no other), the dynamic linker has less work to do. In addition, the implementation of a particular function such as readdir would have to convert the return values and do error checking etc upon exiting the syscall() function, which is some extra overhead. A program that runs syscall() directly would work with the direct return values of the system call and would not need that conversion (it would still need to do the error checking, which would complicate the function significantly).
The downside of running syscall() directly is that you move to an API that is less portable. The syscall() manpage explains some architecture-specific constraints that libc deals with for you; if you use syscall() directly, your function might work on the architecture you're dealing with, but would fail on, say, an arm machine.
In general, I would recommend against using the syscall() API directly, for the very same reason that I would recommend against writing the code in assembly language directly. Yes, that might end up being faster in the end, but the maintenance burden becomes (much) higher. Some things you could do instead:
- Don't care about performance. Systems keep getting cheaper, and in many cases "adding another system so things go faster" is cheaper than "paying a programmer's hourly rates to improve performance".
- Compile the software against static libraries rather than using shared libraries, for the few small things where performance is critical (i.e.,
gcc -static) - Use a profiler to see where things are going slow, and focus on those things, rather than worrying about how to do a system call.
Thank you for your detailed response. I agree that direct system calls should be avoided and portability (and, perhaps more importantly, readability) is a more important factor than performance in general. In my question, I was specifically curious about the performance implications of usingsyscalldirectly as opposed to a POSIX-defined API which ultimately probably makes the same underlying system call. Could you clarify in your answer whether you think that "translation overhead" or "less frequent kernel calls" is likely to make the greater performance difference, here?
– Christopher Schultz
Jan 10 '18 at 21:06
He's saying that there is a one-time cost in theglibcversion of the call due to the overhead of dynamic linking. There is no reduction in the number of system calls in the usual case of calling a function that always just calls the underlying system call. In some cases, calling the "POSIX" calls (really just library calls) may have a huge reduction: something likegetcwill only make an occasional kernel call to read many characters and then cache the result in-process, so most calls don't go the kernel at all.
– BeeOnRope
Feb 21 '18 at 19:29
Wouter - interesting point about the PLT, but you might point out this is a one-time cost: if the cost of any given system call is important, you are likely calling it many times (or else it wouldn't be important) and the PLT cost goes to zero (in a relative sense). In my benchmarks, usingsyscall(...)is slower than calling the library wrapper (by about 4 cycles, generally), probably becausesyscallis a generic varargs method and has to shuffle all the arguments around to make them line up with the syscall interface, while the wrapper functions have less of an issue.
– BeeOnRope
Feb 21 '18 at 19:32
@BeeOnRope I actually did point that out ;-) but my main point was "don't worry about performance in a generic sense, worry about the output of your profiler". I stand by that, as your point shows, too.
– Wouter Verhelst
Feb 26 '18 at 17:11
Directory scans on Linux take from start at about 5µs for an empty directory. The PLT overhead (<1ns) or even the function call overhead (1-2ns) of readdir there is a drop in the bucket. Completely negligible.
– PSkocik
35 mins ago
add a comment |
Factors like PLT indirection or syscall()'s variadic-ness (registers have to be saved to memory) should play little role given that getdents is one of the most expensive calls in Linux.
Fully reading an empty directory on my machine costs about 5µs, a 100-item directory 37µs, a 1000-item directory 340µs and a 10,000-item directory 3.79ms.
What fdopen+readdir does on top of getdents is it adds a buffer allocation/deallocation (0.1µs) and a stat check that the supplied fd is of the directory variety (0.4µs).
So there is something like a one-time overhead of 0.5µs, which is 10% of the directory scanning time for empty directories, but only 1% for 100-item directories and negligibly little for larger ones. This overhead is 5 times less (only the allocation/deallocation cost) if you you don't need fdopen. (you only need fdopen if you can't diropen directly and therefore must go through a separately obtained (e.g., openat'ted) filedescriptor).
So if you use a custom one-time allocated buffer along with getdents, you can save 2-10% on the scanning cost of empty directories and negligibly little on the bigger ones.
(The cost of PLT indirection on modern hardware is typically less than a 1ns, function call overhead is about 1-2ns. Given that the directory scanning times are in the order of microseconds, you'd need to make at least 1000 calls for these factors to translate into a single µs, but then the scanning cost is 340µs and the accrued extra µs is like 0.3% -- negligible effect.)
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f415944%2fperformance-of-calling-posix-specified-functions-versus-direct-linux-kernel-call%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Functions like readdir() and friends are implemented in libc, which is a shared library. As with all shared libraries, that adds some redirection in order to be able to resolve the memory address of the function inside the shared library.
The first time any particular library call is performed, the dynamic linker needs to look up the address of the library call inside a hash table. This involves at least one (but possibly more) string comparisons, a comparatively expensive method. The found address is then saved in the PLT (procedure linkage table), so that the next time the function is called, the overhead of finding the function is reduced to three instructions (on the x86 architecture, fewer than that on some other architectures). This is why compiling something as a shared object (rather than a static object) has some overhead. For more information on shared library overhead and how shared libraries work on Linux, see Ulrich Drepper's detailed technical explanation on the subject.
The syscall() function itself is also implemented in libc, so it too has that redirection. However, since you would only use that function (and no other), the dynamic linker has less work to do. In addition, the implementation of a particular function such as readdir would have to convert the return values and do error checking etc upon exiting the syscall() function, which is some extra overhead. A program that runs syscall() directly would work with the direct return values of the system call and would not need that conversion (it would still need to do the error checking, which would complicate the function significantly).
The downside of running syscall() directly is that you move to an API that is less portable. The syscall() manpage explains some architecture-specific constraints that libc deals with for you; if you use syscall() directly, your function might work on the architecture you're dealing with, but would fail on, say, an arm machine.
In general, I would recommend against using the syscall() API directly, for the very same reason that I would recommend against writing the code in assembly language directly. Yes, that might end up being faster in the end, but the maintenance burden becomes (much) higher. Some things you could do instead:
- Don't care about performance. Systems keep getting cheaper, and in many cases "adding another system so things go faster" is cheaper than "paying a programmer's hourly rates to improve performance".
- Compile the software against static libraries rather than using shared libraries, for the few small things where performance is critical (i.e.,
gcc -static) - Use a profiler to see where things are going slow, and focus on those things, rather than worrying about how to do a system call.
Thank you for your detailed response. I agree that direct system calls should be avoided and portability (and, perhaps more importantly, readability) is a more important factor than performance in general. In my question, I was specifically curious about the performance implications of usingsyscalldirectly as opposed to a POSIX-defined API which ultimately probably makes the same underlying system call. Could you clarify in your answer whether you think that "translation overhead" or "less frequent kernel calls" is likely to make the greater performance difference, here?
– Christopher Schultz
Jan 10 '18 at 21:06
He's saying that there is a one-time cost in theglibcversion of the call due to the overhead of dynamic linking. There is no reduction in the number of system calls in the usual case of calling a function that always just calls the underlying system call. In some cases, calling the "POSIX" calls (really just library calls) may have a huge reduction: something likegetcwill only make an occasional kernel call to read many characters and then cache the result in-process, so most calls don't go the kernel at all.
– BeeOnRope
Feb 21 '18 at 19:29
Wouter - interesting point about the PLT, but you might point out this is a one-time cost: if the cost of any given system call is important, you are likely calling it many times (or else it wouldn't be important) and the PLT cost goes to zero (in a relative sense). In my benchmarks, usingsyscall(...)is slower than calling the library wrapper (by about 4 cycles, generally), probably becausesyscallis a generic varargs method and has to shuffle all the arguments around to make them line up with the syscall interface, while the wrapper functions have less of an issue.
– BeeOnRope
Feb 21 '18 at 19:32
@BeeOnRope I actually did point that out ;-) but my main point was "don't worry about performance in a generic sense, worry about the output of your profiler". I stand by that, as your point shows, too.
– Wouter Verhelst
Feb 26 '18 at 17:11
Directory scans on Linux take from start at about 5µs for an empty directory. The PLT overhead (<1ns) or even the function call overhead (1-2ns) of readdir there is a drop in the bucket. Completely negligible.
– PSkocik
35 mins ago
add a comment |
Functions like readdir() and friends are implemented in libc, which is a shared library. As with all shared libraries, that adds some redirection in order to be able to resolve the memory address of the function inside the shared library.
The first time any particular library call is performed, the dynamic linker needs to look up the address of the library call inside a hash table. This involves at least one (but possibly more) string comparisons, a comparatively expensive method. The found address is then saved in the PLT (procedure linkage table), so that the next time the function is called, the overhead of finding the function is reduced to three instructions (on the x86 architecture, fewer than that on some other architectures). This is why compiling something as a shared object (rather than a static object) has some overhead. For more information on shared library overhead and how shared libraries work on Linux, see Ulrich Drepper's detailed technical explanation on the subject.
The syscall() function itself is also implemented in libc, so it too has that redirection. However, since you would only use that function (and no other), the dynamic linker has less work to do. In addition, the implementation of a particular function such as readdir would have to convert the return values and do error checking etc upon exiting the syscall() function, which is some extra overhead. A program that runs syscall() directly would work with the direct return values of the system call and would not need that conversion (it would still need to do the error checking, which would complicate the function significantly).
The downside of running syscall() directly is that you move to an API that is less portable. The syscall() manpage explains some architecture-specific constraints that libc deals with for you; if you use syscall() directly, your function might work on the architecture you're dealing with, but would fail on, say, an arm machine.
In general, I would recommend against using the syscall() API directly, for the very same reason that I would recommend against writing the code in assembly language directly. Yes, that might end up being faster in the end, but the maintenance burden becomes (much) higher. Some things you could do instead:
- Don't care about performance. Systems keep getting cheaper, and in many cases "adding another system so things go faster" is cheaper than "paying a programmer's hourly rates to improve performance".
- Compile the software against static libraries rather than using shared libraries, for the few small things where performance is critical (i.e.,
gcc -static) - Use a profiler to see where things are going slow, and focus on those things, rather than worrying about how to do a system call.
Thank you for your detailed response. I agree that direct system calls should be avoided and portability (and, perhaps more importantly, readability) is a more important factor than performance in general. In my question, I was specifically curious about the performance implications of usingsyscalldirectly as opposed to a POSIX-defined API which ultimately probably makes the same underlying system call. Could you clarify in your answer whether you think that "translation overhead" or "less frequent kernel calls" is likely to make the greater performance difference, here?
– Christopher Schultz
Jan 10 '18 at 21:06
He's saying that there is a one-time cost in theglibcversion of the call due to the overhead of dynamic linking. There is no reduction in the number of system calls in the usual case of calling a function that always just calls the underlying system call. In some cases, calling the "POSIX" calls (really just library calls) may have a huge reduction: something likegetcwill only make an occasional kernel call to read many characters and then cache the result in-process, so most calls don't go the kernel at all.
– BeeOnRope
Feb 21 '18 at 19:29
Wouter - interesting point about the PLT, but you might point out this is a one-time cost: if the cost of any given system call is important, you are likely calling it many times (or else it wouldn't be important) and the PLT cost goes to zero (in a relative sense). In my benchmarks, usingsyscall(...)is slower than calling the library wrapper (by about 4 cycles, generally), probably becausesyscallis a generic varargs method and has to shuffle all the arguments around to make them line up with the syscall interface, while the wrapper functions have less of an issue.
– BeeOnRope
Feb 21 '18 at 19:32
@BeeOnRope I actually did point that out ;-) but my main point was "don't worry about performance in a generic sense, worry about the output of your profiler". I stand by that, as your point shows, too.
– Wouter Verhelst
Feb 26 '18 at 17:11
Directory scans on Linux take from start at about 5µs for an empty directory. The PLT overhead (<1ns) or even the function call overhead (1-2ns) of readdir there is a drop in the bucket. Completely negligible.
– PSkocik
35 mins ago
add a comment |
Functions like readdir() and friends are implemented in libc, which is a shared library. As with all shared libraries, that adds some redirection in order to be able to resolve the memory address of the function inside the shared library.
The first time any particular library call is performed, the dynamic linker needs to look up the address of the library call inside a hash table. This involves at least one (but possibly more) string comparisons, a comparatively expensive method. The found address is then saved in the PLT (procedure linkage table), so that the next time the function is called, the overhead of finding the function is reduced to three instructions (on the x86 architecture, fewer than that on some other architectures). This is why compiling something as a shared object (rather than a static object) has some overhead. For more information on shared library overhead and how shared libraries work on Linux, see Ulrich Drepper's detailed technical explanation on the subject.
The syscall() function itself is also implemented in libc, so it too has that redirection. However, since you would only use that function (and no other), the dynamic linker has less work to do. In addition, the implementation of a particular function such as readdir would have to convert the return values and do error checking etc upon exiting the syscall() function, which is some extra overhead. A program that runs syscall() directly would work with the direct return values of the system call and would not need that conversion (it would still need to do the error checking, which would complicate the function significantly).
The downside of running syscall() directly is that you move to an API that is less portable. The syscall() manpage explains some architecture-specific constraints that libc deals with for you; if you use syscall() directly, your function might work on the architecture you're dealing with, but would fail on, say, an arm machine.
In general, I would recommend against using the syscall() API directly, for the very same reason that I would recommend against writing the code in assembly language directly. Yes, that might end up being faster in the end, but the maintenance burden becomes (much) higher. Some things you could do instead:
- Don't care about performance. Systems keep getting cheaper, and in many cases "adding another system so things go faster" is cheaper than "paying a programmer's hourly rates to improve performance".
- Compile the software against static libraries rather than using shared libraries, for the few small things where performance is critical (i.e.,
gcc -static) - Use a profiler to see where things are going slow, and focus on those things, rather than worrying about how to do a system call.
Functions like readdir() and friends are implemented in libc, which is a shared library. As with all shared libraries, that adds some redirection in order to be able to resolve the memory address of the function inside the shared library.
The first time any particular library call is performed, the dynamic linker needs to look up the address of the library call inside a hash table. This involves at least one (but possibly more) string comparisons, a comparatively expensive method. The found address is then saved in the PLT (procedure linkage table), so that the next time the function is called, the overhead of finding the function is reduced to three instructions (on the x86 architecture, fewer than that on some other architectures). This is why compiling something as a shared object (rather than a static object) has some overhead. For more information on shared library overhead and how shared libraries work on Linux, see Ulrich Drepper's detailed technical explanation on the subject.
The syscall() function itself is also implemented in libc, so it too has that redirection. However, since you would only use that function (and no other), the dynamic linker has less work to do. In addition, the implementation of a particular function such as readdir would have to convert the return values and do error checking etc upon exiting the syscall() function, which is some extra overhead. A program that runs syscall() directly would work with the direct return values of the system call and would not need that conversion (it would still need to do the error checking, which would complicate the function significantly).
The downside of running syscall() directly is that you move to an API that is less portable. The syscall() manpage explains some architecture-specific constraints that libc deals with for you; if you use syscall() directly, your function might work on the architecture you're dealing with, but would fail on, say, an arm machine.
In general, I would recommend against using the syscall() API directly, for the very same reason that I would recommend against writing the code in assembly language directly. Yes, that might end up being faster in the end, but the maintenance burden becomes (much) higher. Some things you could do instead:
- Don't care about performance. Systems keep getting cheaper, and in many cases "adding another system so things go faster" is cheaper than "paying a programmer's hourly rates to improve performance".
- Compile the software against static libraries rather than using shared libraries, for the few small things where performance is critical (i.e.,
gcc -static) - Use a profiler to see where things are going slow, and focus on those things, rather than worrying about how to do a system call.
answered Jan 10 '18 at 11:35
Wouter VerhelstWouter Verhelst
7,68710 silver badges35 bronze badges
7,68710 silver badges35 bronze badges
Thank you for your detailed response. I agree that direct system calls should be avoided and portability (and, perhaps more importantly, readability) is a more important factor than performance in general. In my question, I was specifically curious about the performance implications of usingsyscalldirectly as opposed to a POSIX-defined API which ultimately probably makes the same underlying system call. Could you clarify in your answer whether you think that "translation overhead" or "less frequent kernel calls" is likely to make the greater performance difference, here?
– Christopher Schultz
Jan 10 '18 at 21:06
He's saying that there is a one-time cost in theglibcversion of the call due to the overhead of dynamic linking. There is no reduction in the number of system calls in the usual case of calling a function that always just calls the underlying system call. In some cases, calling the "POSIX" calls (really just library calls) may have a huge reduction: something likegetcwill only make an occasional kernel call to read many characters and then cache the result in-process, so most calls don't go the kernel at all.
– BeeOnRope
Feb 21 '18 at 19:29
Wouter - interesting point about the PLT, but you might point out this is a one-time cost: if the cost of any given system call is important, you are likely calling it many times (or else it wouldn't be important) and the PLT cost goes to zero (in a relative sense). In my benchmarks, usingsyscall(...)is slower than calling the library wrapper (by about 4 cycles, generally), probably becausesyscallis a generic varargs method and has to shuffle all the arguments around to make them line up with the syscall interface, while the wrapper functions have less of an issue.
– BeeOnRope
Feb 21 '18 at 19:32
@BeeOnRope I actually did point that out ;-) but my main point was "don't worry about performance in a generic sense, worry about the output of your profiler". I stand by that, as your point shows, too.
– Wouter Verhelst
Feb 26 '18 at 17:11
Directory scans on Linux take from start at about 5µs for an empty directory. The PLT overhead (<1ns) or even the function call overhead (1-2ns) of readdir there is a drop in the bucket. Completely negligible.
– PSkocik
35 mins ago
add a comment |
Thank you for your detailed response. I agree that direct system calls should be avoided and portability (and, perhaps more importantly, readability) is a more important factor than performance in general. In my question, I was specifically curious about the performance implications of usingsyscalldirectly as opposed to a POSIX-defined API which ultimately probably makes the same underlying system call. Could you clarify in your answer whether you think that "translation overhead" or "less frequent kernel calls" is likely to make the greater performance difference, here?
– Christopher Schultz
Jan 10 '18 at 21:06
He's saying that there is a one-time cost in theglibcversion of the call due to the overhead of dynamic linking. There is no reduction in the number of system calls in the usual case of calling a function that always just calls the underlying system call. In some cases, calling the "POSIX" calls (really just library calls) may have a huge reduction: something likegetcwill only make an occasional kernel call to read many characters and then cache the result in-process, so most calls don't go the kernel at all.
– BeeOnRope
Feb 21 '18 at 19:29
Wouter - interesting point about the PLT, but you might point out this is a one-time cost: if the cost of any given system call is important, you are likely calling it many times (or else it wouldn't be important) and the PLT cost goes to zero (in a relative sense). In my benchmarks, usingsyscall(...)is slower than calling the library wrapper (by about 4 cycles, generally), probably becausesyscallis a generic varargs method and has to shuffle all the arguments around to make them line up with the syscall interface, while the wrapper functions have less of an issue.
– BeeOnRope
Feb 21 '18 at 19:32
@BeeOnRope I actually did point that out ;-) but my main point was "don't worry about performance in a generic sense, worry about the output of your profiler". I stand by that, as your point shows, too.
– Wouter Verhelst
Feb 26 '18 at 17:11
Directory scans on Linux take from start at about 5µs for an empty directory. The PLT overhead (<1ns) or even the function call overhead (1-2ns) of readdir there is a drop in the bucket. Completely negligible.
– PSkocik
35 mins ago
Thank you for your detailed response. I agree that direct system calls should be avoided and portability (and, perhaps more importantly, readability) is a more important factor than performance in general. In my question, I was specifically curious about the performance implications of using
syscall directly as opposed to a POSIX-defined API which ultimately probably makes the same underlying system call. Could you clarify in your answer whether you think that "translation overhead" or "less frequent kernel calls" is likely to make the greater performance difference, here?– Christopher Schultz
Jan 10 '18 at 21:06
Thank you for your detailed response. I agree that direct system calls should be avoided and portability (and, perhaps more importantly, readability) is a more important factor than performance in general. In my question, I was specifically curious about the performance implications of using
syscall directly as opposed to a POSIX-defined API which ultimately probably makes the same underlying system call. Could you clarify in your answer whether you think that "translation overhead" or "less frequent kernel calls" is likely to make the greater performance difference, here?– Christopher Schultz
Jan 10 '18 at 21:06
He's saying that there is a one-time cost in the
glibc version of the call due to the overhead of dynamic linking. There is no reduction in the number of system calls in the usual case of calling a function that always just calls the underlying system call. In some cases, calling the "POSIX" calls (really just library calls) may have a huge reduction: something like getc will only make an occasional kernel call to read many characters and then cache the result in-process, so most calls don't go the kernel at all.– BeeOnRope
Feb 21 '18 at 19:29
He's saying that there is a one-time cost in the
glibc version of the call due to the overhead of dynamic linking. There is no reduction in the number of system calls in the usual case of calling a function that always just calls the underlying system call. In some cases, calling the "POSIX" calls (really just library calls) may have a huge reduction: something like getc will only make an occasional kernel call to read many characters and then cache the result in-process, so most calls don't go the kernel at all.– BeeOnRope
Feb 21 '18 at 19:29
Wouter - interesting point about the PLT, but you might point out this is a one-time cost: if the cost of any given system call is important, you are likely calling it many times (or else it wouldn't be important) and the PLT cost goes to zero (in a relative sense). In my benchmarks, using
syscall(...) is slower than calling the library wrapper (by about 4 cycles, generally), probably because syscall is a generic varargs method and has to shuffle all the arguments around to make them line up with the syscall interface, while the wrapper functions have less of an issue.– BeeOnRope
Feb 21 '18 at 19:32
Wouter - interesting point about the PLT, but you might point out this is a one-time cost: if the cost of any given system call is important, you are likely calling it many times (or else it wouldn't be important) and the PLT cost goes to zero (in a relative sense). In my benchmarks, using
syscall(...) is slower than calling the library wrapper (by about 4 cycles, generally), probably because syscall is a generic varargs method and has to shuffle all the arguments around to make them line up with the syscall interface, while the wrapper functions have less of an issue.– BeeOnRope
Feb 21 '18 at 19:32
@BeeOnRope I actually did point that out ;-) but my main point was "don't worry about performance in a generic sense, worry about the output of your profiler". I stand by that, as your point shows, too.
– Wouter Verhelst
Feb 26 '18 at 17:11
@BeeOnRope I actually did point that out ;-) but my main point was "don't worry about performance in a generic sense, worry about the output of your profiler". I stand by that, as your point shows, too.
– Wouter Verhelst
Feb 26 '18 at 17:11
Directory scans on Linux take from start at about 5µs for an empty directory. The PLT overhead (<1ns) or even the function call overhead (1-2ns) of readdir there is a drop in the bucket. Completely negligible.
– PSkocik
35 mins ago
Directory scans on Linux take from start at about 5µs for an empty directory. The PLT overhead (<1ns) or even the function call overhead (1-2ns) of readdir there is a drop in the bucket. Completely negligible.
– PSkocik
35 mins ago
add a comment |
Factors like PLT indirection or syscall()'s variadic-ness (registers have to be saved to memory) should play little role given that getdents is one of the most expensive calls in Linux.
Fully reading an empty directory on my machine costs about 5µs, a 100-item directory 37µs, a 1000-item directory 340µs and a 10,000-item directory 3.79ms.
What fdopen+readdir does on top of getdents is it adds a buffer allocation/deallocation (0.1µs) and a stat check that the supplied fd is of the directory variety (0.4µs).
So there is something like a one-time overhead of 0.5µs, which is 10% of the directory scanning time for empty directories, but only 1% for 100-item directories and negligibly little for larger ones. This overhead is 5 times less (only the allocation/deallocation cost) if you you don't need fdopen. (you only need fdopen if you can't diropen directly and therefore must go through a separately obtained (e.g., openat'ted) filedescriptor).
So if you use a custom one-time allocated buffer along with getdents, you can save 2-10% on the scanning cost of empty directories and negligibly little on the bigger ones.
(The cost of PLT indirection on modern hardware is typically less than a 1ns, function call overhead is about 1-2ns. Given that the directory scanning times are in the order of microseconds, you'd need to make at least 1000 calls for these factors to translate into a single µs, but then the scanning cost is 340µs and the accrued extra µs is like 0.3% -- negligible effect.)
add a comment |
Factors like PLT indirection or syscall()'s variadic-ness (registers have to be saved to memory) should play little role given that getdents is one of the most expensive calls in Linux.
Fully reading an empty directory on my machine costs about 5µs, a 100-item directory 37µs, a 1000-item directory 340µs and a 10,000-item directory 3.79ms.
What fdopen+readdir does on top of getdents is it adds a buffer allocation/deallocation (0.1µs) and a stat check that the supplied fd is of the directory variety (0.4µs).
So there is something like a one-time overhead of 0.5µs, which is 10% of the directory scanning time for empty directories, but only 1% for 100-item directories and negligibly little for larger ones. This overhead is 5 times less (only the allocation/deallocation cost) if you you don't need fdopen. (you only need fdopen if you can't diropen directly and therefore must go through a separately obtained (e.g., openat'ted) filedescriptor).
So if you use a custom one-time allocated buffer along with getdents, you can save 2-10% on the scanning cost of empty directories and negligibly little on the bigger ones.
(The cost of PLT indirection on modern hardware is typically less than a 1ns, function call overhead is about 1-2ns. Given that the directory scanning times are in the order of microseconds, you'd need to make at least 1000 calls for these factors to translate into a single µs, but then the scanning cost is 340µs and the accrued extra µs is like 0.3% -- negligible effect.)
add a comment |
Factors like PLT indirection or syscall()'s variadic-ness (registers have to be saved to memory) should play little role given that getdents is one of the most expensive calls in Linux.
Fully reading an empty directory on my machine costs about 5µs, a 100-item directory 37µs, a 1000-item directory 340µs and a 10,000-item directory 3.79ms.
What fdopen+readdir does on top of getdents is it adds a buffer allocation/deallocation (0.1µs) and a stat check that the supplied fd is of the directory variety (0.4µs).
So there is something like a one-time overhead of 0.5µs, which is 10% of the directory scanning time for empty directories, but only 1% for 100-item directories and negligibly little for larger ones. This overhead is 5 times less (only the allocation/deallocation cost) if you you don't need fdopen. (you only need fdopen if you can't diropen directly and therefore must go through a separately obtained (e.g., openat'ted) filedescriptor).
So if you use a custom one-time allocated buffer along with getdents, you can save 2-10% on the scanning cost of empty directories and negligibly little on the bigger ones.
(The cost of PLT indirection on modern hardware is typically less than a 1ns, function call overhead is about 1-2ns. Given that the directory scanning times are in the order of microseconds, you'd need to make at least 1000 calls for these factors to translate into a single µs, but then the scanning cost is 340µs and the accrued extra µs is like 0.3% -- negligible effect.)
Factors like PLT indirection or syscall()'s variadic-ness (registers have to be saved to memory) should play little role given that getdents is one of the most expensive calls in Linux.
Fully reading an empty directory on my machine costs about 5µs, a 100-item directory 37µs, a 1000-item directory 340µs and a 10,000-item directory 3.79ms.
What fdopen+readdir does on top of getdents is it adds a buffer allocation/deallocation (0.1µs) and a stat check that the supplied fd is of the directory variety (0.4µs).
So there is something like a one-time overhead of 0.5µs, which is 10% of the directory scanning time for empty directories, but only 1% for 100-item directories and negligibly little for larger ones. This overhead is 5 times less (only the allocation/deallocation cost) if you you don't need fdopen. (you only need fdopen if you can't diropen directly and therefore must go through a separately obtained (e.g., openat'ted) filedescriptor).
So if you use a custom one-time allocated buffer along with getdents, you can save 2-10% on the scanning cost of empty directories and negligibly little on the bigger ones.
(The cost of PLT indirection on modern hardware is typically less than a 1ns, function call overhead is about 1-2ns. Given that the directory scanning times are in the order of microseconds, you'd need to make at least 1000 calls for these factors to translate into a single µs, but then the scanning cost is 340µs and the accrued extra µs is like 0.3% -- negligible effect.)
answered 42 mins ago
PSkocikPSkocik
19k7 gold badges55 silver badges106 bronze badges
19k7 gold badges55 silver badges106 bronze badges
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f415944%2fperformance-of-calling-posix-specified-functions-versus-direct-linux-kernel-call%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
POSIX does not specify any system calls, it specifies APIs
– fpmurphy
Jan 10 '18 at 2:14
@fpmurphy1 Pardon me for misspeaking. I meant POSIX-defined APIs not system calls, of course.
– Christopher Schultz
Jan 10 '18 at 21:08