Why does the first method take more than twice as long to create an array?Java: A two dimensional array is...

Why didn't Snape ask Dumbledore why he let "Moody" search his office?

Why didn't Trudy wear a breathing mask in Avatar?

What does the whole letter in Black Panther by Prince NJobu say?

difference between $HOME and ~

How to realize Poles and zeros at infinity??especially through transfer function?

How is the speed of nucleons in the nucleus measured?

Minimum perfect squares needed to sum up to a target

Making a animation of multiple 3D objects rotating

Find the number for the question mark

The locus of polynomials with specified root multiplicities

Spiral Stumper Series: Instructionless Puzzle

I'm made of obsolete parts

How slow was the 6502 BASIC compared to Assembly

"To Verb a Noun"

AC/DC 100A clamp meter

Coffee Grounds and Gritty Butter Cream Icing

Why is the time of useful consciousness only seconds at high altitudes, when I can hold my breath much longer at ground level?

How to accompany with piano in latin music when given only chords?

How to explain that the sums of numerators over sums of denominators isn't the same as the mean of ratios?

Is negative resistance possible?

Can't change the screenshot directory because "com.apple.screencapture: command not found"

In what sense is SL(2,q) "very far from abelian"?

In 1700s, why was 'books that never read' grammatical?

Is there any problem with students seeing faculty naked in university gym?

Why does the first method take more than twice as long to create an array?

Java: A two dimensional array is stored in column-major or row-major order?Why does C++ compilation take so long?How to avoid frequent creation of objects in a loop in java?Why does the order of the loops affect performance when iterating over a 2D array?Why is processing a sorted array faster than processing an unsorted array?Why is processing a sorted array slower than an unsorted array?java immutable class much slowerSpray and akka-http throughput decreased significantly on response length changeWhy does array[idx++]+=“a” increase idx once in Java 8 but twice in Java 9 and 10?How to implement parcelable with my custom class containing Hashmap and SparseArray?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{
margin-bottom:0;
}

I thought it would be quicker to create directly, but in fact, adding loops takes only half the time. What happened that slowed down so much?

Here is the test code

@BenchmarkMode(Mode.AverageTime)

@OutputTimeUnit(TimeUnit.MICROSECONDS)

public class Test_newArray {

    private static int num = 10000;

    private static int length = 10;



    @Benchmark

    public static int[][] newArray() {

        return new int[num][length];

    }



    @Benchmark

    public static int[][] newArray2() {

        int[][] temps = new int[num][];

        for (int i = 0; i < temps.length; i++) {

            temps[i] = new int[length];

        }

        return temps;

    }



}

The test results are as follows.

Benchmark                Mode  Cnt    Score   Error  Units

Test_newArray.newArray   avgt   25  289.254 ± 4.982  us/op

Test_newArray.newArray2  avgt   25  114.364 ± 1.446  us/op

The test environment is as follows

JMH version: 1.21

VM version: JDK 1.8.0_212, OpenJDK 64-Bit Server VM, 25.212-b04

asked 8 hours ago

user10339780

664 bronze badges

New contributor

1

just guessing, maybe because in case of int[num][length], the continuous space of size num x length should be allocated while in case of int[num][], the arrays are allocated arbitrarily

– mangusta
7 hours ago

Do you have any data on what happens in your environment when you vary num and length?

– NPE
7 hours ago

@mangusta There are no 2d arrays in Java, I believe your guess is wrong. stackoverflow.com/a/6631081/1570854

– Lesiak
7 hours ago

1

If you run JMH with -prof perfasm, you might gain some helpful insights. E.g. I can see lots of ObjArrayKlass::multi_allocate present in the output of the first method, but absent in the second one. My guess: reflection overhead?

– knittl
7 hours ago

1

@Bohemian Doesn't JMH execute each benchmark in isolation (i.e. in its own JVM) and handle warmup for you? By default, I believe 5 forks are used per benchmark, where each fork runs 5 warmup iterations and 5 measurement iterations.

– Slaw
6 hours ago

|
show 4 more comments

I thought it would be quicker to create directly, but in fact, adding loops takes only half the time. What happened that slowed down so much?

Here is the test code

@BenchmarkMode(Mode.AverageTime)

@OutputTimeUnit(TimeUnit.MICROSECONDS)

public class Test_newArray {

    private static int num = 10000;

    private static int length = 10;



    @Benchmark

    public static int[][] newArray() {

        return new int[num][length];

    }



    @Benchmark

    public static int[][] newArray2() {

        int[][] temps = new int[num][];

        for (int i = 0; i < temps.length; i++) {

            temps[i] = new int[length];

        }

        return temps;

    }



}

The test results are as follows.

Benchmark                Mode  Cnt    Score   Error  Units

Test_newArray.newArray   avgt   25  289.254 ± 4.982  us/op

Test_newArray.newArray2  avgt   25  114.364 ± 1.446  us/op

The test environment is as follows

JMH version: 1.21

VM version: JDK 1.8.0_212, OpenJDK 64-Bit Server VM, 25.212-b04

asked 8 hours ago

user10339780

664 bronze badges

New contributor

1

just guessing, maybe because in case of int[num][length], the continuous space of size num x length should be allocated while in case of int[num][], the arrays are allocated arbitrarily

– mangusta
7 hours ago

Do you have any data on what happens in your environment when you vary num and length?

– NPE
7 hours ago

@mangusta There are no 2d arrays in Java, I believe your guess is wrong. stackoverflow.com/a/6631081/1570854

– Lesiak
7 hours ago

1

If you run JMH with -prof perfasm, you might gain some helpful insights. E.g. I can see lots of ObjArrayKlass::multi_allocate present in the output of the first method, but absent in the second one. My guess: reflection overhead?

– knittl
7 hours ago

1

@Bohemian Doesn't JMH execute each benchmark in isolation (i.e. in its own JVM) and handle warmup for you? By default, I believe 5 forks are used per benchmark, where each fork runs 5 warmup iterations and 5 measurement iterations.

– Slaw
6 hours ago

|
show 4 more comments

I thought it would be quicker to create directly, but in fact, adding loops takes only half the time. What happened that slowed down so much?

Here is the test code

@BenchmarkMode(Mode.AverageTime)

@OutputTimeUnit(TimeUnit.MICROSECONDS)

public class Test_newArray {

    private static int num = 10000;

    private static int length = 10;



    @Benchmark

    public static int[][] newArray() {

        return new int[num][length];

    }



    @Benchmark

    public static int[][] newArray2() {

        int[][] temps = new int[num][];

        for (int i = 0; i < temps.length; i++) {

            temps[i] = new int[length];

        }

        return temps;

    }



}

The test results are as follows.

Benchmark                Mode  Cnt    Score   Error  Units

Test_newArray.newArray   avgt   25  289.254 ± 4.982  us/op

Test_newArray.newArray2  avgt   25  114.364 ± 1.446  us/op

The test environment is as follows

JMH version: 1.21

VM version: JDK 1.8.0_212, OpenJDK 64-Bit Server VM, 25.212-b04

asked 8 hours ago

user10339780

664 bronze badges

New contributor

I thought it would be quicker to create directly, but in fact, adding loops takes only half the time. What happened that slowed down so much?

Here is the test code

@BenchmarkMode(Mode.AverageTime)

@OutputTimeUnit(TimeUnit.MICROSECONDS)

public class Test_newArray {

    private static int num = 10000;

    private static int length = 10;



    @Benchmark

    public static int[][] newArray() {

        return new int[num][length];

    }



    @Benchmark

    public static int[][] newArray2() {

        int[][] temps = new int[num][];

        for (int i = 0; i < temps.length; i++) {

            temps[i] = new int[length];

        }

        return temps;

    }



}

The test results are as follows.

Benchmark                Mode  Cnt    Score   Error  Units

Test_newArray.newArray   avgt   25  289.254 ± 4.982  us/op

Test_newArray.newArray2  avgt   25  114.364 ± 1.446  us/op

The test environment is as follows

JMH version: 1.21

VM version: JDK 1.8.0_212, OpenJDK 64-Bit Server VM, 25.212-b04

java performance

asked 8 hours ago

user10339780

664 bronze badges

New contributor

asked 8 hours ago

user10339780

664 bronze badges

New contributor

asked 8 hours ago

user10339780

664 bronze badges

New contributor

asked 8 hours ago

user10339780

664 bronze badges

asked 8 hours ago

user10339780

664 bronze badges

New contributor

1

just guessing, maybe because in case of int[num][length], the continuous space of size num x length should be allocated while in case of int[num][], the arrays are allocated arbitrarily

– mangusta
7 hours ago

Do you have any data on what happens in your environment when you vary num and length?

– NPE
7 hours ago

@mangusta There are no 2d arrays in Java, I believe your guess is wrong. stackoverflow.com/a/6631081/1570854

– Lesiak
7 hours ago

1

If you run JMH with -prof perfasm, you might gain some helpful insights. E.g. I can see lots of ObjArrayKlass::multi_allocate present in the output of the first method, but absent in the second one. My guess: reflection overhead?

– knittl
7 hours ago

1

@Bohemian Doesn't JMH execute each benchmark in isolation (i.e. in its own JVM) and handle warmup for you? By default, I believe 5 forks are used per benchmark, where each fork runs 5 warmup iterations and 5 measurement iterations.

– Slaw
6 hours ago

|
show 4 more comments

1

just guessing, maybe because in case of int[num][length], the continuous space of size num x length should be allocated while in case of int[num][], the arrays are allocated arbitrarily

– mangusta
7 hours ago

Do you have any data on what happens in your environment when you vary num and length?

– NPE
7 hours ago

@mangusta There are no 2d arrays in Java, I believe your guess is wrong. stackoverflow.com/a/6631081/1570854

– Lesiak
7 hours ago

1

If you run JMH with -prof perfasm, you might gain some helpful insights. E.g. I can see lots of ObjArrayKlass::multi_allocate present in the output of the first method, but absent in the second one. My guess: reflection overhead?

– knittl
7 hours ago

1

@Bohemian Doesn't JMH execute each benchmark in isolation (i.e. in its own JVM) and handle warmup for you? By default, I believe 5 forks are used per benchmark, where each fork runs 5 warmup iterations and 5 measurement iterations.

– Slaw
6 hours ago

just guessing, maybe because in case of int[num][length], the continuous space of size num x length should be allocated while in case of int[num][], the arrays are allocated arbitrarily

– mangusta
7 hours ago

Do you have any data on what happens in your environment when you vary num and length?

– NPE
7 hours ago

@mangusta There are no 2d arrays in Java, I believe your guess is wrong. stackoverflow.com/a/6631081/1570854

– Lesiak
7 hours ago

If you run JMH with -prof perfasm, you might gain some helpful insights. E.g. I can see lots of ObjArrayKlass::multi_allocate present in the output of the first method, but absent in the second one. My guess: reflection overhead?

– knittl
7 hours ago

@Bohemian Doesn't JMH execute each benchmark in isolation (i.e. in its own JVM) and handle warmup for you? By default, I believe 5 forks are used per benchmark, where each fork runs 5 warmup iterations and 5 measurement iterations.

– Slaw
6 hours ago

|
show 4 more comments

2 Answers
2

active

oldest

votes

In Java there is a separate bytecode instruction for allocating multidimensional arrays - multianewarray.

newArray benchmark uses multianewarray bytecode;

newArray2 invokes simple newarray in the loop.

The problem is that HotSpot JVM has no fast path^* for multianewarray bytecode. This instruction is always executed in VM runtime. Therefore, the allocation is not inlined in the compiled code.

The first benchmark has to pay performance penalty of switching between Java and VM Runtime contexts. Also, the common allocation code in the VM runtime (written in C++) is not as optimized as inlined allocation in JIT-compiled code, just because it is generic, i.e. not optimized for the particular object type or for the particular call site, it performs additional runtime checks, etc.

Here are the results of profiling both benchmarks with async-profiler. I used JDK 11.0.4, but for JDK 8 the picture looks similar.

newArray

newArray2

In the first case, 99% time is spent inside OptoRuntime::multianewarray2_C - the C++ code in the VM runtime.

In the second case, the most of the graph is green, meaning that the program runs mostly in Java context, actually executing JIT-compiled code optimized specifically for the given benchmark.

EDIT

^* Just to clarify: in HotSpot multianewarray is not optimized very well by design. It is rather costly to implement such a complex operation in both JIT compilers properly, while the benefits of such optimization would be questionable: allocation of multidimentional arrays is rarely a performance bottleneck in a typical application.

edited 5 hours ago

answered 5 hours ago

apangin

57.8k8 gold badges115 silver badges147 bronze badges

Nice answer, especially including a flame graph – I wish they were more common.

– kaan
4 hours ago

Thank you for your answer. It's a good answer to my doubts. The problem is the entertainment of rest time. It's not really about optimizing the details.

– user10339780
2 hours ago

add a comment
|

From the Oracle Docs:

It may be more efficient to use newarray or anewarray when creating an array of a single dimension.

The newArray benchmark uses anewarray bytecode instruction, while newArray2 - multianewarray bytecode instruction, which makes up a difference in the end.

With the perf Linux profiler I got the following results.

For the newArray benchmark, the hottest regions are:

....[Hottest Methods (after inlining)]..............................................................

 22.58%           libjvm.so  MemAllocator::allocate

 14.80%           libjvm.so  ObjArrayAllocator::initialize

 12.92%           libjvm.so  TypeArrayKlass::multi_allocate

 10.98%           libjvm.so  AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<2670710ul, G1BarrierSet>, (AccessInternal::BarrierType)1, 2670710ul>::oop_access_barrier

  7.38%           libjvm.so  ObjArrayKlass::multi_allocate

  6.02%           libjvm.so  MemAllocator::Allocation::notify_allocation_jvmti_sampler

  5.84%          ld-2.27.so  __tls_get_addr

  5.66%           libjvm.so  CollectedHeap::array_allocate

  5.39%           libjvm.so  Klass::check_array_allocation_length

  4.76%        libc-2.27.so  __memset_avx2_unaligned_erms

  0.75%        libc-2.27.so  __memset_avx2_erms

  0.38%           libjvm.so  __tls_get_addr@plt

  0.17%           libjvm.so  memset@plt

  0.10%           libjvm.so  G1ParScanThreadState::copy_to_survivor_space

  0.10%   [kernel.kallsyms]  update_blocked_averages

  0.06%   [kernel.kallsyms]  native_write_msr

  0.05%           libjvm.so  G1ParScanThreadState::trim_queue

  0.05%           libjvm.so  Monitor::lock_without_safepoint_check

  0.05%           libjvm.so  G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region

  0.05%           libjvm.so  OtherRegionsTable::occupied

  1.92%  <...other 288 warm methods...>

For the newArray2:

....[Hottest Methods (after inlining)]..............................................................

 93.45%      perf-28023.map  [unknown]

  0.26%           libjvm.so  G1ParScanThreadState::copy_to_survivor_space

  0.22%   [kernel.kallsyms]  update_blocked_averages

  0.19%           libjvm.so  OtherRegionsTable::is_empty

  0.17%        libc-2.27.so  __memset_avx2_erms

  0.16%        libc-2.27.so  __memset_avx2_unaligned_erms

  0.14%           libjvm.so  OptoRuntime::new_array_C

  0.12%           libjvm.so  G1ParScanThreadState::trim_queue

  0.11%           libjvm.so  G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region

  0.11%           libjvm.so  MemAllocator::allocate_inside_tlab_slow

  0.11%           libjvm.so  ObjArrayAllocator::initialize

  0.10%           libjvm.so  OtherRegionsTable::occupied

  0.10%           libjvm.so  MemAllocator::allocate

  0.10%           libjvm.so  Monitor::lock_without_safepoint_check

  0.10%   [kernel.kallsyms]  rt2800pci_rxdone_tasklet

  0.09%           libjvm.so  G1Allocator::unsafe_max_tlab_alloc

  0.08%           libjvm.so  ThreadLocalAllocBuffer::fill

  0.08%          ld-2.27.so  __tls_get_addr

  0.07%           libjvm.so  G1CollectedHeap::allocate_new_tlab

  0.07%           libjvm.so  TypeArrayKlass::allocate_common

  4.15%  <...other 411 warm methods...>

As we can see, for the slower newArray benchmark most of the time is spent in:

MemAllocator::allocate

ObjArrayAllocator::initialize

TypeArrayKlass::multi_allocate

ObjArrayKlass::multi_allocate

...

While the newArray2 benchmark uses the OptoRuntime::new_array_C, spending a lot less time allocating the memory for arrays.

As a result, note the difference in the number of cycles and instructions:

Benchmark                        Mode  Cnt        Score    Error  Units

newArray                         avgt    4      448.018 ± 80.029  us/op

newArray:CPI                     avgt             0.359            #/op

newArray:L1-dcache-load-misses   avgt         10399.712            #/op

newArray:L1-dcache-loads         avgt       1032985.924            #/op

newArray:L1-dcache-stores        avgt        590756.905            #/op

newArray:cycles                  avgt       1132753.204            #/op

newArray:instructions            avgt       3159465.006            #/op

newArray2                        avgt    4      125.531 ± 50.749  us/op

newArray2:CPI                    avgt             0.532            #/op

newArray2:L1-dcache-load-misses  avgt         10345.720            #/op

newArray2:L1-dcache-loads        avgt         85185.726            #/op

newArray2:L1-dcache-stores       avgt        103096.223            #/op

newArray2:cycles                 avgt        346651.432            #/op

newArray2:instructions           avgt        652155.439            #/op

edited 5 hours ago

answered 7 hours ago

Oleksandr Pyrohov

11.2k4 gold badges45 silver badges76 bronze badges

4

it would be nice if you elaborated on these results because I didn't get anything ;)

– Andrew Tobilko
7 hours ago

Thank you for the detailed data, Let me see clearly why.

– user10339780
2 hours ago

add a comment
|

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

user10339780 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f58158445%2fwhy-does-the-first-method-take-more-than-twice-as-long-to-create-an-array%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

In Java there is a separate bytecode instruction for allocating multidimensional arrays - multianewarray.

newArray benchmark uses multianewarray bytecode;

newArray2 invokes simple newarray in the loop.

The problem is that HotSpot JVM has no fast path^* for multianewarray bytecode. This instruction is always executed in VM runtime. Therefore, the allocation is not inlined in the compiled code.

Here are the results of profiling both benchmarks with async-profiler. I used JDK 11.0.4, but for JDK 8 the picture looks similar.

newArray

newArray2

In the first case, 99% time is spent inside OptoRuntime::multianewarray2_C - the C++ code in the VM runtime.

In the second case, the most of the graph is green, meaning that the program runs mostly in Java context, actually executing JIT-compiled code optimized specifically for the given benchmark.

EDIT

edited 5 hours ago

answered 5 hours ago

apangin

57.8k8 gold badges115 silver badges147 bronze badges

Nice answer, especially including a flame graph – I wish they were more common.

– kaan
4 hours ago

Thank you for your answer. It's a good answer to my doubts. The problem is the entertainment of rest time. It's not really about optimizing the details.

– user10339780
2 hours ago

add a comment
|

In Java there is a separate bytecode instruction for allocating multidimensional arrays - multianewarray.

newArray benchmark uses multianewarray bytecode;

newArray2 invokes simple newarray in the loop.

The problem is that HotSpot JVM has no fast path^* for multianewarray bytecode. This instruction is always executed in VM runtime. Therefore, the allocation is not inlined in the compiled code.

Here are the results of profiling both benchmarks with async-profiler. I used JDK 11.0.4, but for JDK 8 the picture looks similar.

newArray

newArray2

In the first case, 99% time is spent inside OptoRuntime::multianewarray2_C - the C++ code in the VM runtime.

In the second case, the most of the graph is green, meaning that the program runs mostly in Java context, actually executing JIT-compiled code optimized specifically for the given benchmark.

EDIT

edited 5 hours ago

answered 5 hours ago

apangin

57.8k8 gold badges115 silver badges147 bronze badges

Nice answer, especially including a flame graph – I wish they were more common.

– kaan
4 hours ago

Thank you for your answer. It's a good answer to my doubts. The problem is the entertainment of rest time. It's not really about optimizing the details.

– user10339780
2 hours ago

add a comment
|

In Java there is a separate bytecode instruction for allocating multidimensional arrays - multianewarray.

newArray benchmark uses multianewarray bytecode;

newArray2 invokes simple newarray in the loop.

The problem is that HotSpot JVM has no fast path^* for multianewarray bytecode. This instruction is always executed in VM runtime. Therefore, the allocation is not inlined in the compiled code.

Here are the results of profiling both benchmarks with async-profiler. I used JDK 11.0.4, but for JDK 8 the picture looks similar.

newArray

newArray2

In the first case, 99% time is spent inside OptoRuntime::multianewarray2_C - the C++ code in the VM runtime.

In the second case, the most of the graph is green, meaning that the program runs mostly in Java context, actually executing JIT-compiled code optimized specifically for the given benchmark.

EDIT

edited 5 hours ago

answered 5 hours ago

apangin

57.8k8 gold badges115 silver badges147 bronze badges

In Java there is a separate bytecode instruction for allocating multidimensional arrays - multianewarray.

newArray benchmark uses multianewarray bytecode;

newArray2 invokes simple newarray in the loop.

The problem is that HotSpot JVM has no fast path^* for multianewarray bytecode. This instruction is always executed in VM runtime. Therefore, the allocation is not inlined in the compiled code.

Here are the results of profiling both benchmarks with async-profiler. I used JDK 11.0.4, but for JDK 8 the picture looks similar.

newArray

newArray2

In the first case, 99% time is spent inside OptoRuntime::multianewarray2_C - the C++ code in the VM runtime.

In the second case, the most of the graph is green, meaning that the program runs mostly in Java context, actually executing JIT-compiled code optimized specifically for the given benchmark.

EDIT

edited 5 hours ago

answered 5 hours ago

apangin

57.8k8 gold badges115 silver badges147 bronze badges

edited 5 hours ago

answered 5 hours ago

apangin

57.8k8 gold badges115 silver badges147 bronze badges

answered 5 hours ago

apangin

57.8k8 gold badges115 silver badges147 bronze badges

answered 5 hours ago

apangin

57.8k8 gold badges115 silver badges147 bronze badges

Nice answer, especially including a flame graph – I wish they were more common.

– kaan
4 hours ago

Thank you for your answer. It's a good answer to my doubts. The problem is the entertainment of rest time. It's not really about optimizing the details.

– user10339780
2 hours ago

add a comment
|

Nice answer, especially including a flame graph – I wish they were more common.

– kaan
4 hours ago

Thank you for your answer. It's a good answer to my doubts. The problem is the entertainment of rest time. It's not really about optimizing the details.

– user10339780
2 hours ago

Nice answer, especially including a flame graph – I wish they were more common.

– kaan
4 hours ago

Thank you for your answer. It's a good answer to my doubts. The problem is the entertainment of rest time. It's not really about optimizing the details.

– user10339780
2 hours ago

add a comment
|

From the Oracle Docs:

It may be more efficient to use newarray or anewarray when creating an array of a single dimension.

The newArray benchmark uses anewarray bytecode instruction, while newArray2 - multianewarray bytecode instruction, which makes up a difference in the end.

With the perf Linux profiler I got the following results.

For the newArray benchmark, the hottest regions are:

....[Hottest Methods (after inlining)]..............................................................

 22.58%           libjvm.so  MemAllocator::allocate

 14.80%           libjvm.so  ObjArrayAllocator::initialize

 12.92%           libjvm.so  TypeArrayKlass::multi_allocate

 10.98%           libjvm.so  AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<2670710ul, G1BarrierSet>, (AccessInternal::BarrierType)1, 2670710ul>::oop_access_barrier

  7.38%           libjvm.so  ObjArrayKlass::multi_allocate

  6.02%           libjvm.so  MemAllocator::Allocation::notify_allocation_jvmti_sampler

  5.84%          ld-2.27.so  __tls_get_addr

  5.66%           libjvm.so  CollectedHeap::array_allocate

  5.39%           libjvm.so  Klass::check_array_allocation_length

  4.76%        libc-2.27.so  __memset_avx2_unaligned_erms

  0.75%        libc-2.27.so  __memset_avx2_erms

  0.38%           libjvm.so  __tls_get_addr@plt

  0.17%           libjvm.so  memset@plt

  0.10%           libjvm.so  G1ParScanThreadState::copy_to_survivor_space

  0.10%   [kernel.kallsyms]  update_blocked_averages

  0.06%   [kernel.kallsyms]  native_write_msr

  0.05%           libjvm.so  G1ParScanThreadState::trim_queue

  0.05%           libjvm.so  Monitor::lock_without_safepoint_check

  0.05%           libjvm.so  G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region

  0.05%           libjvm.so  OtherRegionsTable::occupied

  1.92%  <...other 288 warm methods...>

For the newArray2:

....[Hottest Methods (after inlining)]..............................................................

 93.45%      perf-28023.map  [unknown]

  0.26%           libjvm.so  G1ParScanThreadState::copy_to_survivor_space

  0.22%   [kernel.kallsyms]  update_blocked_averages

  0.19%           libjvm.so  OtherRegionsTable::is_empty

  0.17%        libc-2.27.so  __memset_avx2_erms

  0.16%        libc-2.27.so  __memset_avx2_unaligned_erms

  0.14%           libjvm.so  OptoRuntime::new_array_C

  0.12%           libjvm.so  G1ParScanThreadState::trim_queue

  0.11%           libjvm.so  G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region

  0.11%           libjvm.so  MemAllocator::allocate_inside_tlab_slow

  0.11%           libjvm.so  ObjArrayAllocator::initialize

  0.10%           libjvm.so  OtherRegionsTable::occupied

  0.10%           libjvm.so  MemAllocator::allocate

  0.10%           libjvm.so  Monitor::lock_without_safepoint_check

  0.10%   [kernel.kallsyms]  rt2800pci_rxdone_tasklet

  0.09%           libjvm.so  G1Allocator::unsafe_max_tlab_alloc

  0.08%           libjvm.so  ThreadLocalAllocBuffer::fill

  0.08%          ld-2.27.so  __tls_get_addr

  0.07%           libjvm.so  G1CollectedHeap::allocate_new_tlab

  0.07%           libjvm.so  TypeArrayKlass::allocate_common

  4.15%  <...other 411 warm methods...>

As we can see, for the slower newArray benchmark most of the time is spent in:

MemAllocator::allocate

ObjArrayAllocator::initialize

TypeArrayKlass::multi_allocate

ObjArrayKlass::multi_allocate

...

While the newArray2 benchmark uses the OptoRuntime::new_array_C, spending a lot less time allocating the memory for arrays.

As a result, note the difference in the number of cycles and instructions:

Benchmark                        Mode  Cnt        Score    Error  Units

newArray                         avgt    4      448.018 ± 80.029  us/op

newArray:CPI                     avgt             0.359            #/op

newArray:L1-dcache-load-misses   avgt         10399.712            #/op

newArray:L1-dcache-loads         avgt       1032985.924            #/op

newArray:L1-dcache-stores        avgt        590756.905            #/op

newArray:cycles                  avgt       1132753.204            #/op

newArray:instructions            avgt       3159465.006            #/op

newArray2                        avgt    4      125.531 ± 50.749  us/op

newArray2:CPI                    avgt             0.532            #/op

newArray2:L1-dcache-load-misses  avgt         10345.720            #/op

newArray2:L1-dcache-loads        avgt         85185.726            #/op

newArray2:L1-dcache-stores       avgt        103096.223            #/op

newArray2:cycles                 avgt        346651.432            #/op

newArray2:instructions           avgt        652155.439            #/op

edited 5 hours ago

answered 7 hours ago

Oleksandr Pyrohov

11.2k4 gold badges45 silver badges76 bronze badges

4

it would be nice if you elaborated on these results because I didn't get anything ;)

– Andrew Tobilko
7 hours ago

Thank you for the detailed data, Let me see clearly why.

– user10339780
2 hours ago

add a comment
|

From the Oracle Docs:

It may be more efficient to use newarray or anewarray when creating an array of a single dimension.

The newArray benchmark uses anewarray bytecode instruction, while newArray2 - multianewarray bytecode instruction, which makes up a difference in the end.

With the perf Linux profiler I got the following results.

For the newArray benchmark, the hottest regions are:

....[Hottest Methods (after inlining)]..............................................................

 22.58%           libjvm.so  MemAllocator::allocate

 14.80%           libjvm.so  ObjArrayAllocator::initialize

 12.92%           libjvm.so  TypeArrayKlass::multi_allocate

 10.98%           libjvm.so  AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<2670710ul, G1BarrierSet>, (AccessInternal::BarrierType)1, 2670710ul>::oop_access_barrier

  7.38%           libjvm.so  ObjArrayKlass::multi_allocate

  6.02%           libjvm.so  MemAllocator::Allocation::notify_allocation_jvmti_sampler

  5.84%          ld-2.27.so  __tls_get_addr

  5.66%           libjvm.so  CollectedHeap::array_allocate

  5.39%           libjvm.so  Klass::check_array_allocation_length

  4.76%        libc-2.27.so  __memset_avx2_unaligned_erms

  0.75%        libc-2.27.so  __memset_avx2_erms

  0.38%           libjvm.so  __tls_get_addr@plt

  0.17%           libjvm.so  memset@plt

  0.10%           libjvm.so  G1ParScanThreadState::copy_to_survivor_space

  0.10%   [kernel.kallsyms]  update_blocked_averages

  0.06%   [kernel.kallsyms]  native_write_msr

  0.05%           libjvm.so  G1ParScanThreadState::trim_queue

  0.05%           libjvm.so  Monitor::lock_without_safepoint_check

  0.05%           libjvm.so  G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region

  0.05%           libjvm.so  OtherRegionsTable::occupied

  1.92%  <...other 288 warm methods...>

For the newArray2:

....[Hottest Methods (after inlining)]..............................................................

 93.45%      perf-28023.map  [unknown]

  0.26%           libjvm.so  G1ParScanThreadState::copy_to_survivor_space

  0.22%   [kernel.kallsyms]  update_blocked_averages

  0.19%           libjvm.so  OtherRegionsTable::is_empty

  0.17%        libc-2.27.so  __memset_avx2_erms

  0.16%        libc-2.27.so  __memset_avx2_unaligned_erms

  0.14%           libjvm.so  OptoRuntime::new_array_C

  0.12%           libjvm.so  G1ParScanThreadState::trim_queue

  0.11%           libjvm.so  G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region

  0.11%           libjvm.so  MemAllocator::allocate_inside_tlab_slow

  0.11%           libjvm.so  ObjArrayAllocator::initialize

  0.10%           libjvm.so  OtherRegionsTable::occupied

  0.10%           libjvm.so  MemAllocator::allocate

  0.10%           libjvm.so  Monitor::lock_without_safepoint_check

  0.10%   [kernel.kallsyms]  rt2800pci_rxdone_tasklet

  0.09%           libjvm.so  G1Allocator::unsafe_max_tlab_alloc

  0.08%           libjvm.so  ThreadLocalAllocBuffer::fill

  0.08%          ld-2.27.so  __tls_get_addr

  0.07%           libjvm.so  G1CollectedHeap::allocate_new_tlab

  0.07%           libjvm.so  TypeArrayKlass::allocate_common

  4.15%  <...other 411 warm methods...>

As we can see, for the slower newArray benchmark most of the time is spent in:

MemAllocator::allocate

ObjArrayAllocator::initialize

TypeArrayKlass::multi_allocate

ObjArrayKlass::multi_allocate

...

While the newArray2 benchmark uses the OptoRuntime::new_array_C, spending a lot less time allocating the memory for arrays.

As a result, note the difference in the number of cycles and instructions:

Benchmark                        Mode  Cnt        Score    Error  Units

newArray                         avgt    4      448.018 ± 80.029  us/op

newArray:CPI                     avgt             0.359            #/op

newArray:L1-dcache-load-misses   avgt         10399.712            #/op

newArray:L1-dcache-loads         avgt       1032985.924            #/op

newArray:L1-dcache-stores        avgt        590756.905            #/op

newArray:cycles                  avgt       1132753.204            #/op

newArray:instructions            avgt       3159465.006            #/op

newArray2                        avgt    4      125.531 ± 50.749  us/op

newArray2:CPI                    avgt             0.532            #/op

newArray2:L1-dcache-load-misses  avgt         10345.720            #/op

newArray2:L1-dcache-loads        avgt         85185.726            #/op

newArray2:L1-dcache-stores       avgt        103096.223            #/op

newArray2:cycles                 avgt        346651.432            #/op

newArray2:instructions           avgt        652155.439            #/op

edited 5 hours ago

answered 7 hours ago

Oleksandr Pyrohov

11.2k4 gold badges45 silver badges76 bronze badges

4

it would be nice if you elaborated on these results because I didn't get anything ;)

– Andrew Tobilko
7 hours ago

Thank you for the detailed data, Let me see clearly why.

– user10339780
2 hours ago

add a comment
|

From the Oracle Docs:

It may be more efficient to use newarray or anewarray when creating an array of a single dimension.

The newArray benchmark uses anewarray bytecode instruction, while newArray2 - multianewarray bytecode instruction, which makes up a difference in the end.

With the perf Linux profiler I got the following results.

For the newArray benchmark, the hottest regions are:

....[Hottest Methods (after inlining)]..............................................................

 22.58%           libjvm.so  MemAllocator::allocate

 14.80%           libjvm.so  ObjArrayAllocator::initialize

 12.92%           libjvm.so  TypeArrayKlass::multi_allocate

 10.98%           libjvm.so  AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<2670710ul, G1BarrierSet>, (AccessInternal::BarrierType)1, 2670710ul>::oop_access_barrier

  7.38%           libjvm.so  ObjArrayKlass::multi_allocate

  6.02%           libjvm.so  MemAllocator::Allocation::notify_allocation_jvmti_sampler

  5.84%          ld-2.27.so  __tls_get_addr

  5.66%           libjvm.so  CollectedHeap::array_allocate

  5.39%           libjvm.so  Klass::check_array_allocation_length

  4.76%        libc-2.27.so  __memset_avx2_unaligned_erms

  0.75%        libc-2.27.so  __memset_avx2_erms

  0.38%           libjvm.so  __tls_get_addr@plt

  0.17%           libjvm.so  memset@plt

  0.10%           libjvm.so  G1ParScanThreadState::copy_to_survivor_space

  0.10%   [kernel.kallsyms]  update_blocked_averages

  0.06%   [kernel.kallsyms]  native_write_msr

  0.05%           libjvm.so  G1ParScanThreadState::trim_queue

  0.05%           libjvm.so  Monitor::lock_without_safepoint_check

  0.05%           libjvm.so  G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region

  0.05%           libjvm.so  OtherRegionsTable::occupied

  1.92%  <...other 288 warm methods...>

For the newArray2:

....[Hottest Methods (after inlining)]..............................................................

 93.45%      perf-28023.map  [unknown]

  0.26%           libjvm.so  G1ParScanThreadState::copy_to_survivor_space

  0.22%   [kernel.kallsyms]  update_blocked_averages

  0.19%           libjvm.so  OtherRegionsTable::is_empty

  0.17%        libc-2.27.so  __memset_avx2_erms

  0.16%        libc-2.27.so  __memset_avx2_unaligned_erms

  0.14%           libjvm.so  OptoRuntime::new_array_C

  0.12%           libjvm.so  G1ParScanThreadState::trim_queue

  0.11%           libjvm.so  G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region

  0.11%           libjvm.so  MemAllocator::allocate_inside_tlab_slow

  0.11%           libjvm.so  ObjArrayAllocator::initialize

  0.10%           libjvm.so  OtherRegionsTable::occupied

  0.10%           libjvm.so  MemAllocator::allocate

  0.10%           libjvm.so  Monitor::lock_without_safepoint_check

  0.10%   [kernel.kallsyms]  rt2800pci_rxdone_tasklet

  0.09%           libjvm.so  G1Allocator::unsafe_max_tlab_alloc

  0.08%           libjvm.so  ThreadLocalAllocBuffer::fill

  0.08%          ld-2.27.so  __tls_get_addr

  0.07%           libjvm.so  G1CollectedHeap::allocate_new_tlab

  0.07%           libjvm.so  TypeArrayKlass::allocate_common

  4.15%  <...other 411 warm methods...>

As we can see, for the slower newArray benchmark most of the time is spent in:

MemAllocator::allocate

ObjArrayAllocator::initialize

TypeArrayKlass::multi_allocate

ObjArrayKlass::multi_allocate

...

While the newArray2 benchmark uses the OptoRuntime::new_array_C, spending a lot less time allocating the memory for arrays.

As a result, note the difference in the number of cycles and instructions:

Benchmark                        Mode  Cnt        Score    Error  Units

newArray                         avgt    4      448.018 ± 80.029  us/op

newArray:CPI                     avgt             0.359            #/op

newArray:L1-dcache-load-misses   avgt         10399.712            #/op

newArray:L1-dcache-loads         avgt       1032985.924            #/op

newArray:L1-dcache-stores        avgt        590756.905            #/op

newArray:cycles                  avgt       1132753.204            #/op

newArray:instructions            avgt       3159465.006            #/op

newArray2                        avgt    4      125.531 ± 50.749  us/op

newArray2:CPI                    avgt             0.532            #/op

newArray2:L1-dcache-load-misses  avgt         10345.720            #/op

newArray2:L1-dcache-loads        avgt         85185.726            #/op

newArray2:L1-dcache-stores       avgt        103096.223            #/op

newArray2:cycles                 avgt        346651.432            #/op

newArray2:instructions           avgt        652155.439            #/op

edited 5 hours ago

answered 7 hours ago

Oleksandr Pyrohov

11.2k4 gold badges45 silver badges76 bronze badges

From the Oracle Docs:

It may be more efficient to use newarray or anewarray when creating an array of a single dimension.

The newArray benchmark uses anewarray bytecode instruction, while newArray2 - multianewarray bytecode instruction, which makes up a difference in the end.

With the perf Linux profiler I got the following results.

For the newArray benchmark, the hottest regions are:

....[Hottest Methods (after inlining)]..............................................................

 22.58%           libjvm.so  MemAllocator::allocate

 14.80%           libjvm.so  ObjArrayAllocator::initialize

 12.92%           libjvm.so  TypeArrayKlass::multi_allocate

 10.98%           libjvm.so  AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<2670710ul, G1BarrierSet>, (AccessInternal::BarrierType)1, 2670710ul>::oop_access_barrier

  7.38%           libjvm.so  ObjArrayKlass::multi_allocate

  6.02%           libjvm.so  MemAllocator::Allocation::notify_allocation_jvmti_sampler

  5.84%          ld-2.27.so  __tls_get_addr

  5.66%           libjvm.so  CollectedHeap::array_allocate

  5.39%           libjvm.so  Klass::check_array_allocation_length

  4.76%        libc-2.27.so  __memset_avx2_unaligned_erms

  0.75%        libc-2.27.so  __memset_avx2_erms

  0.38%           libjvm.so  __tls_get_addr@plt

  0.17%           libjvm.so  memset@plt

  0.10%           libjvm.so  G1ParScanThreadState::copy_to_survivor_space

  0.10%   [kernel.kallsyms]  update_blocked_averages

  0.06%   [kernel.kallsyms]  native_write_msr

  0.05%           libjvm.so  G1ParScanThreadState::trim_queue

  0.05%           libjvm.so  Monitor::lock_without_safepoint_check

  0.05%           libjvm.so  G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region

  0.05%           libjvm.so  OtherRegionsTable::occupied

  1.92%  <...other 288 warm methods...>

For the newArray2:

....[Hottest Methods (after inlining)]..............................................................

 93.45%      perf-28023.map  [unknown]

  0.26%           libjvm.so  G1ParScanThreadState::copy_to_survivor_space

  0.22%   [kernel.kallsyms]  update_blocked_averages

  0.19%           libjvm.so  OtherRegionsTable::is_empty

  0.17%        libc-2.27.so  __memset_avx2_erms

  0.16%        libc-2.27.so  __memset_avx2_unaligned_erms

  0.14%           libjvm.so  OptoRuntime::new_array_C

  0.12%           libjvm.so  G1ParScanThreadState::trim_queue

  0.11%           libjvm.so  G1FreeCollectionSetTask::G1SerialFreeCollectionSetClosure::do_heap_region

  0.11%           libjvm.so  MemAllocator::allocate_inside_tlab_slow

  0.11%           libjvm.so  ObjArrayAllocator::initialize

  0.10%           libjvm.so  OtherRegionsTable::occupied

  0.10%           libjvm.so  MemAllocator::allocate

  0.10%           libjvm.so  Monitor::lock_without_safepoint_check

  0.10%   [kernel.kallsyms]  rt2800pci_rxdone_tasklet

  0.09%           libjvm.so  G1Allocator::unsafe_max_tlab_alloc

  0.08%           libjvm.so  ThreadLocalAllocBuffer::fill

  0.08%          ld-2.27.so  __tls_get_addr

  0.07%           libjvm.so  G1CollectedHeap::allocate_new_tlab

  0.07%           libjvm.so  TypeArrayKlass::allocate_common

  4.15%  <...other 411 warm methods...>

As we can see, for the slower newArray benchmark most of the time is spent in:

MemAllocator::allocate

ObjArrayAllocator::initialize

TypeArrayKlass::multi_allocate

ObjArrayKlass::multi_allocate

...

While the newArray2 benchmark uses the OptoRuntime::new_array_C, spending a lot less time allocating the memory for arrays.

As a result, note the difference in the number of cycles and instructions:

Benchmark                        Mode  Cnt        Score    Error  Units

newArray                         avgt    4      448.018 ± 80.029  us/op

newArray:CPI                     avgt             0.359            #/op

newArray:L1-dcache-load-misses   avgt         10399.712            #/op

newArray:L1-dcache-loads         avgt       1032985.924            #/op

newArray:L1-dcache-stores        avgt        590756.905            #/op

newArray:cycles                  avgt       1132753.204            #/op

newArray:instructions            avgt       3159465.006            #/op

newArray2                        avgt    4      125.531 ± 50.749  us/op

newArray2:CPI                    avgt             0.532            #/op

newArray2:L1-dcache-load-misses  avgt         10345.720            #/op

newArray2:L1-dcache-loads        avgt         85185.726            #/op

newArray2:L1-dcache-stores       avgt        103096.223            #/op

newArray2:cycles                 avgt        346651.432            #/op

newArray2:instructions           avgt        652155.439            #/op

edited 5 hours ago

answered 7 hours ago

Oleksandr Pyrohov

11.2k4 gold badges45 silver badges76 bronze badges

edited 5 hours ago

answered 7 hours ago

Oleksandr Pyrohov

11.2k4 gold badges45 silver badges76 bronze badges

answered 7 hours ago

Oleksandr Pyrohov

11.2k4 gold badges45 silver badges76 bronze badges

answered 7 hours ago

Oleksandr Pyrohov

11.2k4 gold badges45 silver badges76 bronze badges

4

it would be nice if you elaborated on these results because I didn't get anything ;)

– Andrew Tobilko
7 hours ago

Thank you for the detailed data, Let me see clearly why.

– user10339780
2 hours ago

add a comment
|

4

it would be nice if you elaborated on these results because I didn't get anything ;)

– Andrew Tobilko
7 hours ago

Thank you for the detailed data, Let me see clearly why.

– user10339780
2 hours ago

it would be nice if you elaborated on these results because I didn't get anything ;)

– Andrew Tobilko
7 hours ago

Thank you for the detailed data, Let me see clearly why.

– user10339780
2 hours ago

add a comment
|

user10339780 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

user10339780 is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Mdthbs