Why does the first method take more than twice as long to create an array?Java: A two dimensional array is...
Why didn't Snape ask Dumbledore why he let "Moody" search his office?
Why didn't Trudy wear a breathing mask in Avatar?
What does the whole letter in Black Panther by Prince NJobu say?
difference between $HOME and ~
How to realize Poles and zeros at infinity??especially through transfer function?
How is the speed of nucleons in the nucleus measured?
Minimum perfect squares needed to sum up to a target
Making a animation of multiple 3D objects rotating
Find the number for the question mark
The locus of polynomials with specified root multiplicities
Spiral Stumper Series: Instructionless Puzzle
I'm made of obsolete parts
How slow was the 6502 BASIC compared to Assembly
"To Verb a Noun"
AC/DC 100A clamp meter
Coffee Grounds and Gritty Butter Cream Icing
Why is the time of useful consciousness only seconds at high altitudes, when I can hold my breath much longer at ground level?
How to accompany with piano in latin music when given only chords?
How to explain that the sums of numerators over sums of denominators isn't the same as the mean of ratios?
Is negative resistance possible?
Can't change the screenshot directory because "com.apple.screencapture: command not found"
In what sense is SL(2,q) "very far from abelian"?
In 1700s, why was 'books that never read' grammatical?
Is there any problem with students seeing faculty naked in university gym?
Why does the first method take more than twice as long to create an array?
Java: A two dimensional array is stored in column-major or row-major order?Why does C++ compilation take so long?How to avoid frequent creation of objects in a loop in java?Why does the order of the loops affect performance when iterating over a 2D array?Why is processing a sorted array faster than processing an unsorted array?Why is processing a sorted array slower than an unsorted array?java immutable class much slowerSpray and akka-http throughput decreased significantly on response length changeWhy does array[idx++]+=“a” increase idx once in Java 8 but twice in Java 9 and 10?How to implement parcelable with my custom class containing Hashmap and SparseArray?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{
margin-bottom:0;
}
I thought it would be quicker to create directly, but in fact, adding loops takes only half the time. What happened that slowed down so much?
Here is the test code
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class Test_newArray {
private static int num = 10000;
private static int length = 10;
@Benchmark
public static int[][] newArray() {
return new int[num][length];
}
@Benchmark
public static int[][] newArray2() {
int[][] temps = new int[num][];
for (int i = 0; i < temps.length; i++) {
temps[i] = new int[length];
}
return temps;
}
}
The test results are as follows.
Benchmark Mode Cnt Score Error Units
Test_newArray.newArray avgt 25 289.254 ± 4.982 us/op
Test_newArray.newArray2 avgt 25 114.364 ± 1.446 us/op
The test environment is as follows
JMH version: 1.21
VM version: JDK 1.8.0_212, OpenJDK 64-Bit Server VM, 25.212-b04
java performance
New contributor
|
show 4 more comments
I thought it would be quicker to create directly, but in fact, adding loops takes only half the time. What happened that slowed down so much?
Here is the test code
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class Test_newArray {
private static int num = 10000;
private static int length = 10;
@Benchmark
public static int[][] newArray() {
return new int[num][length];
}
@Benchmark
public static int[][] newArray2() {
int[][] temps = new int[num][];
for (int i = 0; i < temps.length; i++) {
temps[i] = new int[length];
}
return temps;
}
}
The test results are as follows.
Benchmark Mode Cnt Score Error Units
Test_newArray.newArray avgt 25 289.254 ± 4.982 us/op
Test_newArray.newArray2 avgt 25 114.364 ± 1.446 us/op
The test environment is as follows
JMH version: 1.21
VM version: JDK 1.8.0_212, OpenJDK 64-Bit Server VM, 25.212-b04
java performance
New contributor
1
just guessing, maybe because in case ofint[num][length]
, the continuous space of sizenum x length
should be allocated while in case ofint[num][]
, the arrays are allocated arbitrarily
– mangusta
7 hours ago
Do you have any data on what happens in your environment when you varynum
andlength
?
– NPE
7 hours ago
@mangusta There are no 2d arrays in Java, I believe your guess is wrong. stackoverflow.com/a/6631081/1570854
– Lesiak
7 hours ago
1
If you run JMH with-prof perfasm
, you might gain some helpful insights. E.g. I can see lots ofObjArrayKlass::multi_allocate
present in the output of the first method, but absent in the second one. My guess: reflection overhead?
– knittl
7 hours ago
1
@Bohemian Doesn't JMH execute each benchmark in isolation (i.e. in its own JVM) and handle warmup for you? By default, I believe 5 forks are used per benchmark, where each fork runs 5 warmup iterations and 5 measurement iterations.
– Slaw
6 hours ago
|
show 4 more comments
I thought it would be quicker to create directly, but in fact, adding loops takes only half the time. What happened that slowed down so much?
Here is the test code
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class Test_newArray {
private static int num = 10000;
private static int length = 10;
@Benchmark
public static int[][] newArray() {
return new int[num][length];
}
@Benchmark
public static int[][] newArray2() {
int[][] temps = new int[num][];
for (int i = 0; i < temps.length; i++) {
temps[i] = new int[length];
}
return temps;
}
}
The test results are as follows.
Benchmark Mode Cnt Score Error Units
Test_newArray.newArray avgt 25 289.254 ± 4.982 us/op
Test_newArray.newArray2 avgt 25 114.364 ± 1.446 us/op
The test environment is as follows
JMH version: 1.21
VM version: JDK 1.8.0_212, OpenJDK 64-Bit Server VM, 25.212-b04
java performance
New contributor
I thought it would be quicker to create directly, but in fact, adding loops takes only half the time. What happened that slowed down so much?
Here is the test code
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class Test_newArray {
private static int num = 10000;
private static int length = 10;
@Benchmark
public static int[][] newArray() {
return new int[num][length];
}
@Benchmark
public static int[][] newArray2() {
int[][] temps = new int[num][];
for (int i = 0; i < temps.length; i++) {
temps[i] = new int[length];
}
return temps;
}
}
The test results are as follows.
Benchmark Mode Cnt Score Error Units
Test_newArray.newArray avgt 25 289.254 ± 4.982 us/op
Test_newArray.newArray2 avgt 25 114.364 ± 1.446 us/op
The test environment is as follows
JMH version: 1.21
VM version: JDK 1.8.0_212, OpenJDK 64-Bit Server VM, 25.212-b04
java performance
java performance
New contributor
New contributor
New contributor
asked 8 hours ago
user10339780user10339780
664 bronze badges
664 bronze badges
New contributor
New contributor
1
just guessing, maybe because in case ofint[num][length]
, the continuous space of sizenum x length
should be allocated while in case ofint[num][]
, the arrays are allocated arbitrarily
– mangusta
7 hours ago
Do you have any data on what happens in your environment when you varynum
andlength
?
– NPE
7 hours ago
@mangusta There are no 2d arrays in Java, I believe your guess is wrong. stackoverflow.com/a/6631081/1570854
– Lesiak
7 hours ago
1
If you run JMH with-prof perfasm
, you might gain some helpful insights. E.g. I can see lots ofObjArrayKlass::multi_allocate
present in the output of the first method, but absent in the second one. My guess: reflection overhead?
– knittl
7 hours ago
1
@Bohemian Doesn't JMH execute each benchmark in isolation (i.e. in its own JVM) and handle warmup for you? By default, I believe 5 forks are used per benchmark, where each fork runs 5 warmup iterations and 5 measurement iterations.
– Slaw
6 hours ago
|
show 4 more comments
1
just guessing, maybe because in case ofint[num][length]
, the continuous space of sizenum x length
should be allocated while in case ofint[num][]
, the arrays are allocated arbitrarily
– mangusta
7 hours ago
Do you have any data on what happens in your environment when you varynum
andlength
?
– NPE
7 hours ago
@mangusta There are no 2d arrays in Java, I believe your guess is wrong. stackoverflow.com/a/6631081/1570854
– Lesiak
7 hours ago
1
If you run JMH with-prof perfasm
, you might gain some helpful insights. E.g. I can see lots ofObjArrayKlass::multi_allocate
present in the output of the first method, but absent in the second one. My guess: reflection overhead?
– knittl
7 hours ago
1
@Bohemian Doesn't JMH execute each benchmark in isolation (i.e. in its own JVM) and handle warmup for you? By default, I believe 5 forks are used per benchmark, where each fork runs 5 warmup iterations and 5 measurement iterations.
– Slaw
6 hours ago
1
1
just guessing, maybe because in case of
int[num][length]
, the continuous space of size num x length
should be allocated while in case of int[num][]
, the arrays are allocated arbitrarily– mangusta
7 hours ago
just guessing, maybe because in case of
int[num][length]
, the continuous space of size num x length
should be allocated while in case of int[num][]
, the arrays are allocated arbitrarily– mangusta
7 hours ago
Do you have any data on what happens in your environment when you vary
num
and length
?– NPE
7 hours ago
Do you have any data on what happens in your environment when you vary
num
and length
?– NPE
7 hours ago
@mangusta There are no 2d arrays in Java, I believe your guess is wrong. stackoverflow.com/a/6631081/1570854
– Lesiak
7 hours ago
@mangusta There are no 2d arrays in Java, I believe your guess is wrong. stackoverflow.com/a/6631081/1570854
– Lesiak
7 hours ago
1
1
If you run JMH with
-prof perfasm
, you might gain some helpful insights. E.g. I can see lots of ObjArrayKlass::multi_allocate
present in the output of the first method, but absent in the second one. My guess: reflection overhead?– knittl
7 hours ago
If you run JMH with
-prof perfasm
, you might gain some helpful insights. E.g. I can see lots of ObjArrayKlass::multi_allocate
present in the output of the first method, but absent in the second one. My guess: reflection overhead?– knittl
7 hours ago
1
1
@Bohemian Doesn't JMH execute each benchmark in isolation (i.e. in its own JVM) and handle warmup for you? By default, I believe 5 forks are used per benchmark, where each fork runs 5 warmup iterations and 5 measurement iterations.
– Slaw
6 hours ago
@Bohemian Doesn't JMH execute each benchmark in isolation (i.e. in its own JVM) and handle warmup for you? By default, I believe 5 forks are used per benchmark, where each fork runs 5 warmup iterations and 5 measurement iterations.
– Slaw
6 hours ago
|
show 4 more comments
2 Answers
2
active
oldest
votes
In Java there is a separate bytecode instruction for allocating multidimensional arrays - multianewarray
.
newArray
benchmark usesmultianewarray
bytecode;
newArray2
invokes simplenewarray
in the loop.
The problem is that HotSpot JVM has no fast path* for multianewarray
bytecode. This instruction is always executed in VM runtime. Therefore, the allocation is not inlined in the compiled code.
The first benchmark has to pay performance penalty of switching between Java and VM Runtime contexts. Also, the common allocation code in the VM runtime (written in C++) is not as optimized as inlined allocation in JIT-compiled code, just because it is generic, i.e. not optimized for the particular object type or for the particular call site, it performs additional runtime checks, etc.
Here are the results of profiling both benchmarks with async-profiler. I used JDK 11.0.4, but for JDK 8 the picture looks similar.
In the first case, 99% time is spent inside OptoRuntime::multianewarray2_C
- the C++ code in the VM runtime.
In the second case, the most of the graph is green, meaning that the program runs mostly in Java context, actually executing JIT-compiled code optimized specifically for the given benchmark.
EDIT
* Just to clarify: in HotSpot multianewarray
is not optimized very well by design. It is rather costly to implement such a complex operation in both JIT compilers properly, while the benefits of such optimization would be questionable: allocation of multidimentional arrays is rarely a performance bottleneck in a typical application.
Nice answer, especially including a flame graph – I wish they were more common.
– kaan
4 hours ago
Thank you for your answer. It's a good answer to my doubts. The problem is the entertainment of rest time. It's not really about optimizing the details.
– user10339780
2 hours ago
add a comment
|
From the Oracle Docs:
It may be more efficient to use
newarray
oranewarray
when creating an array of a single dimension.
The newArray
benchmark uses anewarray
bytecode instruction, while newArray2
- multianewarray
bytecode instruction, which makes up a difference in the end.
With the perf
Linux profiler I got the following results.
For the newArray
benchmark, the hottest regions are:
....[Hottest Methods (after inlining)]..............................................................
22.58% libjvm.so MemAllocator::allocate
14.80% libjvm.so ObjArrayAllocator::initialize
12.92% libjvm.so TypeArrayKlass::multi_allocate
10.98% libjvm.so AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<2670710ul, G1BarrierSet>, (AccessInternal::BarrierType)1, 2670710ul>::oop_access_barrier
7.38% libjvm.so ObjArrayKlass::multi_allocate
6.02% libjvm.so MemAllocator::Allocation::notify_allocation_jvmti_sampler
5.84% ld-2.27.so __tls_get_addr
5.66% libjvm.so CollectedHeap::array_allocate
5.39% libjvm.so Klass::check_array_allocation_length
4.76% libc-2.27.so __memset_avx2_unaligned_erms
0.75% libc-2.27.so __memset_avx2_erms
0.38% libjvm.so __tls_get_addr@plt
0.17% libjvm.so memset@plt
0.10% libjvm.so G1ParScanThreadState::copy_to_survivor_space
0.10% [kernel.kallsyms] update_blocked_averages
0.06% [kernel.kallsyms] native_write_msr
0.05% libjvm.so G1ParScanThreadState::trim_queue
0.05% libjvm.so Monitor::lock_without_safepoint_check
0.05% libjvm.so G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region
0.05% libjvm.so OtherRegionsTable::occupied
1.92% <...other 288 warm methods...>
For the newArray2
:
....[Hottest Methods (after inlining)]..............................................................
93.45% perf-28023.map [unknown]
0.26% libjvm.so G1ParScanThreadState::copy_to_survivor_space
0.22% [kernel.kallsyms] update_blocked_averages
0.19% libjvm.so OtherRegionsTable::is_empty
0.17% libc-2.27.so __memset_avx2_erms
0.16% libc-2.27.so __memset_avx2_unaligned_erms
0.14% libjvm.so OptoRuntime::new_array_C
0.12% libjvm.so G1ParScanThreadState::trim_queue
0.11% libjvm.so G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region
0.11% libjvm.so MemAllocator::allocate_inside_tlab_slow
0.11% libjvm.so ObjArrayAllocator::initialize
0.10% libjvm.so OtherRegionsTable::occupied
0.10% libjvm.so MemAllocator::allocate
0.10% libjvm.so Monitor::lock_without_safepoint_check
0.10% [kernel.kallsyms] rt2800pci_rxdone_tasklet
0.09% libjvm.so G1Allocator::unsafe_max_tlab_alloc
0.08% libjvm.so ThreadLocalAllocBuffer::fill
0.08% ld-2.27.so __tls_get_addr
0.07% libjvm.so G1CollectedHeap::allocate_new_tlab
0.07% libjvm.so TypeArrayKlass::allocate_common
4.15% <...other 411 warm methods...>
As we can see, for the slower newArray
benchmark most of the time is spent in:
MemAllocator::allocate
ObjArrayAllocator::initialize
TypeArrayKlass::multi_allocate
ObjArrayKlass::multi_allocate
...
While the newArray2
benchmark uses the OptoRuntime::new_array_C
, spending a lot less time allocating the memory for arrays.
As a result, note the difference in the number of cycles and instructions:
Benchmark Mode Cnt Score Error Units
newArray avgt 4 448.018 ± 80.029 us/op
newArray:CPI avgt 0.359 #/op
newArray:L1-dcache-load-misses avgt 10399.712 #/op
newArray:L1-dcache-loads avgt 1032985.924 #/op
newArray:L1-dcache-stores avgt 590756.905 #/op
newArray:cycles avgt 1132753.204 #/op
newArray:instructions avgt 3159465.006 #/op
newArray2 avgt 4 125.531 ± 50.749 us/op
newArray2:CPI avgt 0.532 #/op
newArray2:L1-dcache-load-misses avgt 10345.720 #/op
newArray2:L1-dcache-loads avgt 85185.726 #/op
newArray2:L1-dcache-stores avgt 103096.223 #/op
newArray2:cycles avgt 346651.432 #/op
newArray2:instructions avgt 652155.439 #/op
4
it would be nice if you elaborated on these results because I didn't get anything ;)
– Andrew Tobilko
7 hours ago
Thank you for the detailed data, Let me see clearly why.
– user10339780
2 hours ago
add a comment
|
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
user10339780 is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f58158445%2fwhy-does-the-first-method-take-more-than-twice-as-long-to-create-an-array%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
In Java there is a separate bytecode instruction for allocating multidimensional arrays - multianewarray
.
newArray
benchmark usesmultianewarray
bytecode;
newArray2
invokes simplenewarray
in the loop.
The problem is that HotSpot JVM has no fast path* for multianewarray
bytecode. This instruction is always executed in VM runtime. Therefore, the allocation is not inlined in the compiled code.
The first benchmark has to pay performance penalty of switching between Java and VM Runtime contexts. Also, the common allocation code in the VM runtime (written in C++) is not as optimized as inlined allocation in JIT-compiled code, just because it is generic, i.e. not optimized for the particular object type or for the particular call site, it performs additional runtime checks, etc.
Here are the results of profiling both benchmarks with async-profiler. I used JDK 11.0.4, but for JDK 8 the picture looks similar.
In the first case, 99% time is spent inside OptoRuntime::multianewarray2_C
- the C++ code in the VM runtime.
In the second case, the most of the graph is green, meaning that the program runs mostly in Java context, actually executing JIT-compiled code optimized specifically for the given benchmark.
EDIT
* Just to clarify: in HotSpot multianewarray
is not optimized very well by design. It is rather costly to implement such a complex operation in both JIT compilers properly, while the benefits of such optimization would be questionable: allocation of multidimentional arrays is rarely a performance bottleneck in a typical application.
Nice answer, especially including a flame graph – I wish they were more common.
– kaan
4 hours ago
Thank you for your answer. It's a good answer to my doubts. The problem is the entertainment of rest time. It's not really about optimizing the details.
– user10339780
2 hours ago
add a comment
|
In Java there is a separate bytecode instruction for allocating multidimensional arrays - multianewarray
.
newArray
benchmark usesmultianewarray
bytecode;
newArray2
invokes simplenewarray
in the loop.
The problem is that HotSpot JVM has no fast path* for multianewarray
bytecode. This instruction is always executed in VM runtime. Therefore, the allocation is not inlined in the compiled code.
The first benchmark has to pay performance penalty of switching between Java and VM Runtime contexts. Also, the common allocation code in the VM runtime (written in C++) is not as optimized as inlined allocation in JIT-compiled code, just because it is generic, i.e. not optimized for the particular object type or for the particular call site, it performs additional runtime checks, etc.
Here are the results of profiling both benchmarks with async-profiler. I used JDK 11.0.4, but for JDK 8 the picture looks similar.
In the first case, 99% time is spent inside OptoRuntime::multianewarray2_C
- the C++ code in the VM runtime.
In the second case, the most of the graph is green, meaning that the program runs mostly in Java context, actually executing JIT-compiled code optimized specifically for the given benchmark.
EDIT
* Just to clarify: in HotSpot multianewarray
is not optimized very well by design. It is rather costly to implement such a complex operation in both JIT compilers properly, while the benefits of such optimization would be questionable: allocation of multidimentional arrays is rarely a performance bottleneck in a typical application.
Nice answer, especially including a flame graph – I wish they were more common.
– kaan
4 hours ago
Thank you for your answer. It's a good answer to my doubts. The problem is the entertainment of rest time. It's not really about optimizing the details.
– user10339780
2 hours ago
add a comment
|
In Java there is a separate bytecode instruction for allocating multidimensional arrays - multianewarray
.
newArray
benchmark usesmultianewarray
bytecode;
newArray2
invokes simplenewarray
in the loop.
The problem is that HotSpot JVM has no fast path* for multianewarray
bytecode. This instruction is always executed in VM runtime. Therefore, the allocation is not inlined in the compiled code.
The first benchmark has to pay performance penalty of switching between Java and VM Runtime contexts. Also, the common allocation code in the VM runtime (written in C++) is not as optimized as inlined allocation in JIT-compiled code, just because it is generic, i.e. not optimized for the particular object type or for the particular call site, it performs additional runtime checks, etc.
Here are the results of profiling both benchmarks with async-profiler. I used JDK 11.0.4, but for JDK 8 the picture looks similar.
In the first case, 99% time is spent inside OptoRuntime::multianewarray2_C
- the C++ code in the VM runtime.
In the second case, the most of the graph is green, meaning that the program runs mostly in Java context, actually executing JIT-compiled code optimized specifically for the given benchmark.
EDIT
* Just to clarify: in HotSpot multianewarray
is not optimized very well by design. It is rather costly to implement such a complex operation in both JIT compilers properly, while the benefits of such optimization would be questionable: allocation of multidimentional arrays is rarely a performance bottleneck in a typical application.
In Java there is a separate bytecode instruction for allocating multidimensional arrays - multianewarray
.
newArray
benchmark usesmultianewarray
bytecode;
newArray2
invokes simplenewarray
in the loop.
The problem is that HotSpot JVM has no fast path* for multianewarray
bytecode. This instruction is always executed in VM runtime. Therefore, the allocation is not inlined in the compiled code.
The first benchmark has to pay performance penalty of switching between Java and VM Runtime contexts. Also, the common allocation code in the VM runtime (written in C++) is not as optimized as inlined allocation in JIT-compiled code, just because it is generic, i.e. not optimized for the particular object type or for the particular call site, it performs additional runtime checks, etc.
Here are the results of profiling both benchmarks with async-profiler. I used JDK 11.0.4, but for JDK 8 the picture looks similar.
In the first case, 99% time is spent inside OptoRuntime::multianewarray2_C
- the C++ code in the VM runtime.
In the second case, the most of the graph is green, meaning that the program runs mostly in Java context, actually executing JIT-compiled code optimized specifically for the given benchmark.
EDIT
* Just to clarify: in HotSpot multianewarray
is not optimized very well by design. It is rather costly to implement such a complex operation in both JIT compilers properly, while the benefits of such optimization would be questionable: allocation of multidimentional arrays is rarely a performance bottleneck in a typical application.
edited 5 hours ago
answered 5 hours ago
apanginapangin
57.8k8 gold badges115 silver badges147 bronze badges
57.8k8 gold badges115 silver badges147 bronze badges
Nice answer, especially including a flame graph – I wish they were more common.
– kaan
4 hours ago
Thank you for your answer. It's a good answer to my doubts. The problem is the entertainment of rest time. It's not really about optimizing the details.
– user10339780
2 hours ago
add a comment
|
Nice answer, especially including a flame graph – I wish they were more common.
– kaan
4 hours ago
Thank you for your answer. It's a good answer to my doubts. The problem is the entertainment of rest time. It's not really about optimizing the details.
– user10339780
2 hours ago
Nice answer, especially including a flame graph – I wish they were more common.
– kaan
4 hours ago
Nice answer, especially including a flame graph – I wish they were more common.
– kaan
4 hours ago
Thank you for your answer. It's a good answer to my doubts. The problem is the entertainment of rest time. It's not really about optimizing the details.
– user10339780
2 hours ago
Thank you for your answer. It's a good answer to my doubts. The problem is the entertainment of rest time. It's not really about optimizing the details.
– user10339780
2 hours ago
add a comment
|
From the Oracle Docs:
It may be more efficient to use
newarray
oranewarray
when creating an array of a single dimension.
The newArray
benchmark uses anewarray
bytecode instruction, while newArray2
- multianewarray
bytecode instruction, which makes up a difference in the end.
With the perf
Linux profiler I got the following results.
For the newArray
benchmark, the hottest regions are:
....[Hottest Methods (after inlining)]..............................................................
22.58% libjvm.so MemAllocator::allocate
14.80% libjvm.so ObjArrayAllocator::initialize
12.92% libjvm.so TypeArrayKlass::multi_allocate
10.98% libjvm.so AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<2670710ul, G1BarrierSet>, (AccessInternal::BarrierType)1, 2670710ul>::oop_access_barrier
7.38% libjvm.so ObjArrayKlass::multi_allocate
6.02% libjvm.so MemAllocator::Allocation::notify_allocation_jvmti_sampler
5.84% ld-2.27.so __tls_get_addr
5.66% libjvm.so CollectedHeap::array_allocate
5.39% libjvm.so Klass::check_array_allocation_length
4.76% libc-2.27.so __memset_avx2_unaligned_erms
0.75% libc-2.27.so __memset_avx2_erms
0.38% libjvm.so __tls_get_addr@plt
0.17% libjvm.so memset@plt
0.10% libjvm.so G1ParScanThreadState::copy_to_survivor_space
0.10% [kernel.kallsyms] update_blocked_averages
0.06% [kernel.kallsyms] native_write_msr
0.05% libjvm.so G1ParScanThreadState::trim_queue
0.05% libjvm.so Monitor::lock_without_safepoint_check
0.05% libjvm.so G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region
0.05% libjvm.so OtherRegionsTable::occupied
1.92% <...other 288 warm methods...>
For the newArray2
:
....[Hottest Methods (after inlining)]..............................................................
93.45% perf-28023.map [unknown]
0.26% libjvm.so G1ParScanThreadState::copy_to_survivor_space
0.22% [kernel.kallsyms] update_blocked_averages
0.19% libjvm.so OtherRegionsTable::is_empty
0.17% libc-2.27.so __memset_avx2_erms
0.16% libc-2.27.so __memset_avx2_unaligned_erms
0.14% libjvm.so OptoRuntime::new_array_C
0.12% libjvm.so G1ParScanThreadState::trim_queue
0.11% libjvm.so G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region
0.11% libjvm.so MemAllocator::allocate_inside_tlab_slow
0.11% libjvm.so ObjArrayAllocator::initialize
0.10% libjvm.so OtherRegionsTable::occupied
0.10% libjvm.so MemAllocator::allocate
0.10% libjvm.so Monitor::lock_without_safepoint_check
0.10% [kernel.kallsyms] rt2800pci_rxdone_tasklet
0.09% libjvm.so G1Allocator::unsafe_max_tlab_alloc
0.08% libjvm.so ThreadLocalAllocBuffer::fill
0.08% ld-2.27.so __tls_get_addr
0.07% libjvm.so G1CollectedHeap::allocate_new_tlab
0.07% libjvm.so TypeArrayKlass::allocate_common
4.15% <...other 411 warm methods...>
As we can see, for the slower newArray
benchmark most of the time is spent in:
MemAllocator::allocate
ObjArrayAllocator::initialize
TypeArrayKlass::multi_allocate
ObjArrayKlass::multi_allocate
...
While the newArray2
benchmark uses the OptoRuntime::new_array_C
, spending a lot less time allocating the memory for arrays.
As a result, note the difference in the number of cycles and instructions:
Benchmark Mode Cnt Score Error Units
newArray avgt 4 448.018 ± 80.029 us/op
newArray:CPI avgt 0.359 #/op
newArray:L1-dcache-load-misses avgt 10399.712 #/op
newArray:L1-dcache-loads avgt 1032985.924 #/op
newArray:L1-dcache-stores avgt 590756.905 #/op
newArray:cycles avgt 1132753.204 #/op
newArray:instructions avgt 3159465.006 #/op
newArray2 avgt 4 125.531 ± 50.749 us/op
newArray2:CPI avgt 0.532 #/op
newArray2:L1-dcache-load-misses avgt 10345.720 #/op
newArray2:L1-dcache-loads avgt 85185.726 #/op
newArray2:L1-dcache-stores avgt 103096.223 #/op
newArray2:cycles avgt 346651.432 #/op
newArray2:instructions avgt 652155.439 #/op
4
it would be nice if you elaborated on these results because I didn't get anything ;)
– Andrew Tobilko
7 hours ago
Thank you for the detailed data, Let me see clearly why.
– user10339780
2 hours ago
add a comment
|
From the Oracle Docs:
It may be more efficient to use
newarray
oranewarray
when creating an array of a single dimension.
The newArray
benchmark uses anewarray
bytecode instruction, while newArray2
- multianewarray
bytecode instruction, which makes up a difference in the end.
With the perf
Linux profiler I got the following results.
For the newArray
benchmark, the hottest regions are:
....[Hottest Methods (after inlining)]..............................................................
22.58% libjvm.so MemAllocator::allocate
14.80% libjvm.so ObjArrayAllocator::initialize
12.92% libjvm.so TypeArrayKlass::multi_allocate
10.98% libjvm.so AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<2670710ul, G1BarrierSet>, (AccessInternal::BarrierType)1, 2670710ul>::oop_access_barrier
7.38% libjvm.so ObjArrayKlass::multi_allocate
6.02% libjvm.so MemAllocator::Allocation::notify_allocation_jvmti_sampler
5.84% ld-2.27.so __tls_get_addr
5.66% libjvm.so CollectedHeap::array_allocate
5.39% libjvm.so Klass::check_array_allocation_length
4.76% libc-2.27.so __memset_avx2_unaligned_erms
0.75% libc-2.27.so __memset_avx2_erms
0.38% libjvm.so __tls_get_addr@plt
0.17% libjvm.so memset@plt
0.10% libjvm.so G1ParScanThreadState::copy_to_survivor_space
0.10% [kernel.kallsyms] update_blocked_averages
0.06% [kernel.kallsyms] native_write_msr
0.05% libjvm.so G1ParScanThreadState::trim_queue
0.05% libjvm.so Monitor::lock_without_safepoint_check
0.05% libjvm.so G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region
0.05% libjvm.so OtherRegionsTable::occupied
1.92% <...other 288 warm methods...>
For the newArray2
:
....[Hottest Methods (after inlining)]..............................................................
93.45% perf-28023.map [unknown]
0.26% libjvm.so G1ParScanThreadState::copy_to_survivor_space
0.22% [kernel.kallsyms] update_blocked_averages
0.19% libjvm.so OtherRegionsTable::is_empty
0.17% libc-2.27.so __memset_avx2_erms
0.16% libc-2.27.so __memset_avx2_unaligned_erms
0.14% libjvm.so OptoRuntime::new_array_C
0.12% libjvm.so G1ParScanThreadState::trim_queue
0.11% libjvm.so G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region
0.11% libjvm.so MemAllocator::allocate_inside_tlab_slow
0.11% libjvm.so ObjArrayAllocator::initialize
0.10% libjvm.so OtherRegionsTable::occupied
0.10% libjvm.so MemAllocator::allocate
0.10% libjvm.so Monitor::lock_without_safepoint_check
0.10% [kernel.kallsyms] rt2800pci_rxdone_tasklet
0.09% libjvm.so G1Allocator::unsafe_max_tlab_alloc
0.08% libjvm.so ThreadLocalAllocBuffer::fill
0.08% ld-2.27.so __tls_get_addr
0.07% libjvm.so G1CollectedHeap::allocate_new_tlab
0.07% libjvm.so TypeArrayKlass::allocate_common
4.15% <...other 411 warm methods...>
As we can see, for the slower newArray
benchmark most of the time is spent in:
MemAllocator::allocate
ObjArrayAllocator::initialize
TypeArrayKlass::multi_allocate
ObjArrayKlass::multi_allocate
...
While the newArray2
benchmark uses the OptoRuntime::new_array_C
, spending a lot less time allocating the memory for arrays.
As a result, note the difference in the number of cycles and instructions:
Benchmark Mode Cnt Score Error Units
newArray avgt 4 448.018 ± 80.029 us/op
newArray:CPI avgt 0.359 #/op
newArray:L1-dcache-load-misses avgt 10399.712 #/op
newArray:L1-dcache-loads avgt 1032985.924 #/op
newArray:L1-dcache-stores avgt 590756.905 #/op
newArray:cycles avgt 1132753.204 #/op
newArray:instructions avgt 3159465.006 #/op
newArray2 avgt 4 125.531 ± 50.749 us/op
newArray2:CPI avgt 0.532 #/op
newArray2:L1-dcache-load-misses avgt 10345.720 #/op
newArray2:L1-dcache-loads avgt 85185.726 #/op
newArray2:L1-dcache-stores avgt 103096.223 #/op
newArray2:cycles avgt 346651.432 #/op
newArray2:instructions avgt 652155.439 #/op
4
it would be nice if you elaborated on these results because I didn't get anything ;)
– Andrew Tobilko
7 hours ago
Thank you for the detailed data, Let me see clearly why.
– user10339780
2 hours ago
add a comment
|
From the Oracle Docs:
It may be more efficient to use
newarray
oranewarray
when creating an array of a single dimension.
The newArray
benchmark uses anewarray
bytecode instruction, while newArray2
- multianewarray
bytecode instruction, which makes up a difference in the end.
With the perf
Linux profiler I got the following results.
For the newArray
benchmark, the hottest regions are:
....[Hottest Methods (after inlining)]..............................................................
22.58% libjvm.so MemAllocator::allocate
14.80% libjvm.so ObjArrayAllocator::initialize
12.92% libjvm.so TypeArrayKlass::multi_allocate
10.98% libjvm.so AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<2670710ul, G1BarrierSet>, (AccessInternal::BarrierType)1, 2670710ul>::oop_access_barrier
7.38% libjvm.so ObjArrayKlass::multi_allocate
6.02% libjvm.so MemAllocator::Allocation::notify_allocation_jvmti_sampler
5.84% ld-2.27.so __tls_get_addr
5.66% libjvm.so CollectedHeap::array_allocate
5.39% libjvm.so Klass::check_array_allocation_length
4.76% libc-2.27.so __memset_avx2_unaligned_erms
0.75% libc-2.27.so __memset_avx2_erms
0.38% libjvm.so __tls_get_addr@plt
0.17% libjvm.so memset@plt
0.10% libjvm.so G1ParScanThreadState::copy_to_survivor_space
0.10% [kernel.kallsyms] update_blocked_averages
0.06% [kernel.kallsyms] native_write_msr
0.05% libjvm.so G1ParScanThreadState::trim_queue
0.05% libjvm.so Monitor::lock_without_safepoint_check
0.05% libjvm.so G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region
0.05% libjvm.so OtherRegionsTable::occupied
1.92% <...other 288 warm methods...>
For the newArray2
:
....[Hottest Methods (after inlining)]..............................................................
93.45% perf-28023.map [unknown]
0.26% libjvm.so G1ParScanThreadState::copy_to_survivor_space
0.22% [kernel.kallsyms] update_blocked_averages
0.19% libjvm.so OtherRegionsTable::is_empty
0.17% libc-2.27.so __memset_avx2_erms
0.16% libc-2.27.so __memset_avx2_unaligned_erms
0.14% libjvm.so OptoRuntime::new_array_C
0.12% libjvm.so G1ParScanThreadState::trim_queue
0.11% libjvm.so G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region
0.11% libjvm.so MemAllocator::allocate_inside_tlab_slow
0.11% libjvm.so ObjArrayAllocator::initialize
0.10% libjvm.so OtherRegionsTable::occupied
0.10% libjvm.so MemAllocator::allocate
0.10% libjvm.so Monitor::lock_without_safepoint_check
0.10% [kernel.kallsyms] rt2800pci_rxdone_tasklet
0.09% libjvm.so G1Allocator::unsafe_max_tlab_alloc
0.08% libjvm.so ThreadLocalAllocBuffer::fill
0.08% ld-2.27.so __tls_get_addr
0.07% libjvm.so G1CollectedHeap::allocate_new_tlab
0.07% libjvm.so TypeArrayKlass::allocate_common
4.15% <...other 411 warm methods...>
As we can see, for the slower newArray
benchmark most of the time is spent in:
MemAllocator::allocate
ObjArrayAllocator::initialize
TypeArrayKlass::multi_allocate
ObjArrayKlass::multi_allocate
...
While the newArray2
benchmark uses the OptoRuntime::new_array_C
, spending a lot less time allocating the memory for arrays.
As a result, note the difference in the number of cycles and instructions:
Benchmark Mode Cnt Score Error Units
newArray avgt 4 448.018 ± 80.029 us/op
newArray:CPI avgt 0.359 #/op
newArray:L1-dcache-load-misses avgt 10399.712 #/op
newArray:L1-dcache-loads avgt 1032985.924 #/op
newArray:L1-dcache-stores avgt 590756.905 #/op
newArray:cycles avgt 1132753.204 #/op
newArray:instructions avgt 3159465.006 #/op
newArray2 avgt 4 125.531 ± 50.749 us/op
newArray2:CPI avgt 0.532 #/op
newArray2:L1-dcache-load-misses avgt 10345.720 #/op
newArray2:L1-dcache-loads avgt 85185.726 #/op
newArray2:L1-dcache-stores avgt 103096.223 #/op
newArray2:cycles avgt 346651.432 #/op
newArray2:instructions avgt 652155.439 #/op
From the Oracle Docs:
It may be more efficient to use
newarray
oranewarray
when creating an array of a single dimension.
The newArray
benchmark uses anewarray
bytecode instruction, while newArray2
- multianewarray
bytecode instruction, which makes up a difference in the end.
With the perf
Linux profiler I got the following results.
For the newArray
benchmark, the hottest regions are:
....[Hottest Methods (after inlining)]..............................................................
22.58% libjvm.so MemAllocator::allocate
14.80% libjvm.so ObjArrayAllocator::initialize
12.92% libjvm.so TypeArrayKlass::multi_allocate
10.98% libjvm.so AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<2670710ul, G1BarrierSet>, (AccessInternal::BarrierType)1, 2670710ul>::oop_access_barrier
7.38% libjvm.so ObjArrayKlass::multi_allocate
6.02% libjvm.so MemAllocator::Allocation::notify_allocation_jvmti_sampler
5.84% ld-2.27.so __tls_get_addr
5.66% libjvm.so CollectedHeap::array_allocate
5.39% libjvm.so Klass::check_array_allocation_length
4.76% libc-2.27.so __memset_avx2_unaligned_erms
0.75% libc-2.27.so __memset_avx2_erms
0.38% libjvm.so __tls_get_addr@plt
0.17% libjvm.so memset@plt
0.10% libjvm.so G1ParScanThreadState::copy_to_survivor_space
0.10% [kernel.kallsyms] update_blocked_averages
0.06% [kernel.kallsyms] native_write_msr
0.05% libjvm.so G1ParScanThreadState::trim_queue
0.05% libjvm.so Monitor::lock_without_safepoint_check
0.05% libjvm.so G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region
0.05% libjvm.so OtherRegionsTable::occupied
1.92% <...other 288 warm methods...>
For the newArray2
:
....[Hottest Methods (after inlining)]..............................................................
93.45% perf-28023.map [unknown]
0.26% libjvm.so G1ParScanThreadState::copy_to_survivor_space
0.22% [kernel.kallsyms] update_blocked_averages
0.19% libjvm.so OtherRegionsTable::is_empty
0.17% libc-2.27.so __memset_avx2_erms
0.16% libc-2.27.so __memset_avx2_unaligned_erms
0.14% libjvm.so OptoRuntime::new_array_C
0.12% libjvm.so G1ParScanThreadState::trim_queue
0.11% libjvm.so G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region
0.11% libjvm.so MemAllocator::allocate_inside_tlab_slow
0.11% libjvm.so ObjArrayAllocator::initialize
0.10% libjvm.so OtherRegionsTable::occupied
0.10% libjvm.so MemAllocator::allocate
0.10% libjvm.so Monitor::lock_without_safepoint_check
0.10% [kernel.kallsyms] rt2800pci_rxdone_tasklet
0.09% libjvm.so G1Allocator::unsafe_max_tlab_alloc
0.08% libjvm.so ThreadLocalAllocBuffer::fill
0.08% ld-2.27.so __tls_get_addr
0.07% libjvm.so G1CollectedHeap::allocate_new_tlab
0.07% libjvm.so TypeArrayKlass::allocate_common
4.15% <...other 411 warm methods...>
As we can see, for the slower newArray
benchmark most of the time is spent in:
MemAllocator::allocate
ObjArrayAllocator::initialize
TypeArrayKlass::multi_allocate
ObjArrayKlass::multi_allocate
...
While the newArray2
benchmark uses the OptoRuntime::new_array_C
, spending a lot less time allocating the memory for arrays.
As a result, note the difference in the number of cycles and instructions:
Benchmark Mode Cnt Score Error Units
newArray avgt 4 448.018 ± 80.029 us/op
newArray:CPI avgt 0.359 #/op
newArray:L1-dcache-load-misses avgt 10399.712 #/op
newArray:L1-dcache-loads avgt 1032985.924 #/op
newArray:L1-dcache-stores avgt 590756.905 #/op
newArray:cycles avgt 1132753.204 #/op
newArray:instructions avgt 3159465.006 #/op
newArray2 avgt 4 125.531 ± 50.749 us/op
newArray2:CPI avgt 0.532 #/op
newArray2:L1-dcache-load-misses avgt 10345.720 #/op
newArray2:L1-dcache-loads avgt 85185.726 #/op
newArray2:L1-dcache-stores avgt 103096.223 #/op
newArray2:cycles avgt 346651.432 #/op
newArray2:instructions avgt 652155.439 #/op
edited 5 hours ago
answered 7 hours ago
Oleksandr PyrohovOleksandr Pyrohov
11.2k4 gold badges45 silver badges76 bronze badges
11.2k4 gold badges45 silver badges76 bronze badges
4
it would be nice if you elaborated on these results because I didn't get anything ;)
– Andrew Tobilko
7 hours ago
Thank you for the detailed data, Let me see clearly why.
– user10339780
2 hours ago
add a comment
|
4
it would be nice if you elaborated on these results because I didn't get anything ;)
– Andrew Tobilko
7 hours ago
Thank you for the detailed data, Let me see clearly why.
– user10339780
2 hours ago
4
4
it would be nice if you elaborated on these results because I didn't get anything ;)
– Andrew Tobilko
7 hours ago
it would be nice if you elaborated on these results because I didn't get anything ;)
– Andrew Tobilko
7 hours ago
Thank you for the detailed data, Let me see clearly why.
– user10339780
2 hours ago
Thank you for the detailed data, Let me see clearly why.
– user10339780
2 hours ago
add a comment
|
user10339780 is a new contributor. Be nice, and check out our Code of Conduct.
user10339780 is a new contributor. Be nice, and check out our Code of Conduct.
user10339780 is a new contributor. Be nice, and check out our Code of Conduct.
user10339780 is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f58158445%2fwhy-does-the-first-method-take-more-than-twice-as-long-to-create-an-array%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
just guessing, maybe because in case of
int[num][length]
, the continuous space of sizenum x length
should be allocated while in case ofint[num][]
, the arrays are allocated arbitrarily– mangusta
7 hours ago
Do you have any data on what happens in your environment when you vary
num
andlength
?– NPE
7 hours ago
@mangusta There are no 2d arrays in Java, I believe your guess is wrong. stackoverflow.com/a/6631081/1570854
– Lesiak
7 hours ago
1
If you run JMH with
-prof perfasm
, you might gain some helpful insights. E.g. I can see lots ofObjArrayKlass::multi_allocate
present in the output of the first method, but absent in the second one. My guess: reflection overhead?– knittl
7 hours ago
1
@Bohemian Doesn't JMH execute each benchmark in isolation (i.e. in its own JVM) and handle warmup for you? By default, I believe 5 forks are used per benchmark, where each fork runs 5 warmup iterations and 5 measurement iterations.
– Slaw
6 hours ago