How is TD(0) method helpful? What good does it do?

What is in `tex.print` or `tex.sprint`?

How to supress loops in a digraph?

How do photons get into the eyes?

Did thousands of women die every year due to illegal abortions before Roe v. Wade?

How bad would a partial hash leak be, realistically?

Whats the next step after commercial fusion reactors?

Will TSA allow me to carry a Continuous Positive Airway Pressure (CPAP)/sleep apnea device?

Identification quotas - TIKZ LaTeX

Bent spoke design wheels — feasible?

Is there any word or phrase for negative bearing?

Java guess the number

You've spoiled/damaged the card

Sharing one invocation list between multiple events on the same object in C#

Through what methods and mechanisms can a multi-material FDM printer operate?

Movie where a boy is transported into the future by an alien spaceship

How is it possible that Gollum speaks Westron?

What makes linear regression with polynomial features curvy?

What happens to foam insulation board after you pour concrete slab?

Why don't B747s start takeoffs with full throttle?

How to skip replacing first occurrence of a character in each line?

Do manufacturers try make their components as close to ideal ones as possible?

What is the purpose of building foundations?

Why don’t airliners have temporary liveries?

Are there cubesats in GEO?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}

What is the use of TD(0) method when we talk about temporal difference learning?

The weights in the temporal difference learning are updated as given by the equation (can be referenced as equation number 4 here) :

enter image description here

When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0 , there will never be a change in weight and hence no learning.

When can TD(0) be used? When is it helpful?

I cannot see any reason to use TD(0) method. Am I missing anything?

edited 9 hours ago

nbro

3,6772826

asked 10 hours ago

Amanda

454

$begingroup$
TD(0) is better than MC in most cases. Although it might not appear so but think of TD(0) (updating value of a state based on value of next state) as gaining the experience of all the paths that went through that 'next state's, whereas in MC method you use gain experience only from a single path.
$endgroup$
– DuttaA
9 hours ago

1

$begingroup$
I assume they use the convention 0^0 = 1 so lambda basically is 1.
$endgroup$
– Hanzy
9 hours ago

add a comment |

What is the use of TD(0) method when we talk about temporal difference learning?

The weights in the temporal difference learning are updated as given by the equation (can be referenced as equation number 4 here) :

enter image description here

When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0 , there will never be a change in weight and hence no learning.

When can TD(0) be used? When is it helpful?

I cannot see any reason to use TD(0) method. Am I missing anything?

edited 9 hours ago

nbro

3,6772826

asked 10 hours ago

Amanda

454

$begingroup$
TD(0) is better than MC in most cases. Although it might not appear so but think of TD(0) (updating value of a state based on value of next state) as gaining the experience of all the paths that went through that 'next state's, whereas in MC method you use gain experience only from a single path.
$endgroup$
– DuttaA
9 hours ago

1

$begingroup$
I assume they use the convention 0^0 = 1 so lambda basically is 1.
$endgroup$
– Hanzy
9 hours ago

add a comment |

What is the use of TD(0) method when we talk about temporal difference learning?

The weights in the temporal difference learning are updated as given by the equation (can be referenced as equation number 4 here) :

enter image description here

When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0 , there will never be a change in weight and hence no learning.

When can TD(0) be used? When is it helpful?

I cannot see any reason to use TD(0) method. Am I missing anything?

edited 9 hours ago

nbro

3,6772826

asked 10 hours ago

Amanda

454

What is the use of TD(0) method when we talk about temporal difference learning?

The weights in the temporal difference learning are updated as given by the equation (can be referenced as equation number 4 here) :

enter image description here

When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0 , there will never be a change in weight and hence no learning.

When can TD(0) be used? When is it helpful?

I cannot see any reason to use TD(0) method. Am I missing anything?

reinforcement-learning temporal-difference notation

edited 9 hours ago

nbro

3,6772826

asked 10 hours ago

Amanda

454

edited 9 hours ago

nbro

3,6772826

asked 10 hours ago

Amanda

454

edited 9 hours ago

nbro

3,6772826

edited 9 hours ago

nbro

3,6772826

edited 9 hours ago

nbro

3,6772826

asked 10 hours ago

Amanda

454

asked 10 hours ago

Amanda

454

asked 10 hours ago

Amanda

454

$begingroup$
TD(0) is better than MC in most cases. Although it might not appear so but think of TD(0) (updating value of a state based on value of next state) as gaining the experience of all the paths that went through that 'next state's, whereas in MC method you use gain experience only from a single path.
$endgroup$
– DuttaA
9 hours ago

1

$begingroup$
I assume they use the convention 0^0 = 1 so lambda basically is 1.
$endgroup$
– Hanzy
9 hours ago

add a comment |

$begingroup$
TD(0) is better than MC in most cases. Although it might not appear so but think of TD(0) (updating value of a state based on value of next state) as gaining the experience of all the paths that went through that 'next state's, whereas in MC method you use gain experience only from a single path.
$endgroup$
– DuttaA
9 hours ago

1

$begingroup$
I assume they use the convention 0^0 = 1 so lambda basically is 1.
$endgroup$
– Hanzy
9 hours ago

TD(0) is better than MC in most cases. Although it might not appear so but think of TD(0) (updating value of a state based on value of next state) as gaining the experience of all the paths that went through that 'next state's, whereas in MC method you use gain experience only from a single path.

– DuttaA
9 hours ago

I assume they use the convention 0^0 = 1 so lambda basically is 1.

– Hanzy
9 hours ago

add a comment |

1 Answer
1

active

oldest

votes

When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0, there will never be a change in weight and hence no learning.

I think the detail that you're missing is that one of the terms in the sum (the final "iteration" of the sum, the case where $k = t$) has $lambda$ raised to the power $0$, and anything raised to the power $0$ (even $0$) is equal to $1$. So, for $lambda = 0$, your update equation becomes

$$Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t,$$

which is basically a one-step update (just like Sarsa).

answered 9 hours ago

Dennis Soemers

4,4621437

$begingroup$
At page $16$ of the same paper Learning to Predict by the Methods of Temporal Differences (1988), Sutton actually states that $Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t$ is the learning rule when $lambda = 0$.
$endgroup$
– nbro
7 hours ago

$begingroup$
He starts with the supervised setting and then derives the Widrow-Hoff (or delta) rule. The TD rule is then a special case of the delta rule, where the errors $z - P_t$ are replaced with a summation of the successive temporal-difference predictions. However, how is that specific 1-step TD learning rule exactly related to the usual learning rules of (tabular) temporal difference methods, where apparently no gradient is needed?
$endgroup$
– nbro
7 hours ago

$begingroup$
@nbro You can view tabular methods as methods using linear function "approximation", where there is a single binary feature for every possible state-action pair. Then there would be a gradient needed, but the gradient would simply be $1$ for the "binary feature" corresponding to the state-action pair, and $0$ everywhere else.
$endgroup$
– Dennis Soemers
7 hours ago

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "658"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f12639%2fhow-is-td0-method-helpful-what-good-does-it-do%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0, there will never be a change in weight and hence no learning.

$$Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t,$$

which is basically a one-step update (just like Sarsa).

answered 9 hours ago

Dennis Soemers

4,4621437

$begingroup$
At page $16$ of the same paper Learning to Predict by the Methods of Temporal Differences (1988), Sutton actually states that $Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t$ is the learning rule when $lambda = 0$.
$endgroup$
– nbro
7 hours ago

$begingroup$
He starts with the supervised setting and then derives the Widrow-Hoff (or delta) rule. The TD rule is then a special case of the delta rule, where the errors $z - P_t$ are replaced with a summation of the successive temporal-difference predictions. However, how is that specific 1-step TD learning rule exactly related to the usual learning rules of (tabular) temporal difference methods, where apparently no gradient is needed?
$endgroup$
– nbro
7 hours ago

$begingroup$
@nbro You can view tabular methods as methods using linear function "approximation", where there is a single binary feature for every possible state-action pair. Then there would be a gradient needed, but the gradient would simply be $1$ for the "binary feature" corresponding to the state-action pair, and $0$ everywhere else.
$endgroup$
– Dennis Soemers
7 hours ago

add a comment |

When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0, there will never be a change in weight and hence no learning.

$$Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t,$$

which is basically a one-step update (just like Sarsa).

answered 9 hours ago

Dennis Soemers

4,4621437

$begingroup$
At page $16$ of the same paper Learning to Predict by the Methods of Temporal Differences (1988), Sutton actually states that $Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t$ is the learning rule when $lambda = 0$.
$endgroup$
– nbro
7 hours ago

$begingroup$
He starts with the supervised setting and then derives the Widrow-Hoff (or delta) rule. The TD rule is then a special case of the delta rule, where the errors $z - P_t$ are replaced with a summation of the successive temporal-difference predictions. However, how is that specific 1-step TD learning rule exactly related to the usual learning rules of (tabular) temporal difference methods, where apparently no gradient is needed?
$endgroup$
– nbro
7 hours ago

$begingroup$
@nbro You can view tabular methods as methods using linear function "approximation", where there is a single binary feature for every possible state-action pair. Then there would be a gradient needed, but the gradient would simply be $1$ for the "binary feature" corresponding to the state-action pair, and $0$ everywhere else.
$endgroup$
– Dennis Soemers
7 hours ago

add a comment |

When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0, there will never be a change in weight and hence no learning.

$$Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t,$$

which is basically a one-step update (just like Sarsa).

answered 9 hours ago

Dennis Soemers

4,4621437

When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0, there will never be a change in weight and hence no learning.

$$Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t,$$

which is basically a one-step update (just like Sarsa).

answered 9 hours ago

Dennis Soemers

4,4621437

answered 9 hours ago

Dennis Soemers

4,4621437

answered 9 hours ago

Dennis Soemers

4,4621437

answered 9 hours ago

Dennis Soemers

4,4621437

$begingroup$
At page $16$ of the same paper Learning to Predict by the Methods of Temporal Differences (1988), Sutton actually states that $Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t$ is the learning rule when $lambda = 0$.
$endgroup$
– nbro
7 hours ago

$begingroup$
He starts with the supervised setting and then derives the Widrow-Hoff (or delta) rule. The TD rule is then a special case of the delta rule, where the errors $z - P_t$ are replaced with a summation of the successive temporal-difference predictions. However, how is that specific 1-step TD learning rule exactly related to the usual learning rules of (tabular) temporal difference methods, where apparently no gradient is needed?
$endgroup$
– nbro
7 hours ago

$begingroup$
@nbro You can view tabular methods as methods using linear function "approximation", where there is a single binary feature for every possible state-action pair. Then there would be a gradient needed, but the gradient would simply be $1$ for the "binary feature" corresponding to the state-action pair, and $0$ everywhere else.
$endgroup$
– Dennis Soemers
7 hours ago

add a comment |

$begingroup$
At page $16$ of the same paper Learning to Predict by the Methods of Temporal Differences (1988), Sutton actually states that $Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t$ is the learning rule when $lambda = 0$.
$endgroup$
– nbro
7 hours ago

$begingroup$
He starts with the supervised setting and then derives the Widrow-Hoff (or delta) rule. The TD rule is then a special case of the delta rule, where the errors $z - P_t$ are replaced with a summation of the successive temporal-difference predictions. However, how is that specific 1-step TD learning rule exactly related to the usual learning rules of (tabular) temporal difference methods, where apparently no gradient is needed?
$endgroup$
– nbro
7 hours ago

$begingroup$
@nbro You can view tabular methods as methods using linear function "approximation", where there is a single binary feature for every possible state-action pair. Then there would be a gradient needed, but the gradient would simply be $1$ for the "binary feature" corresponding to the state-action pair, and $0$ everywhere else.
$endgroup$
– Dennis Soemers
7 hours ago

At page $16$ of the same paper Learning to Predict by the Methods of Temporal Differences (1988), Sutton actually states that $Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t$ is the learning rule when $lambda = 0$.

– nbro
7 hours ago

He starts with the supervised setting and then derives the Widrow-Hoff (or delta) rule. The TD rule is then a special case of the delta rule, where the errors $z - P_t$ are replaced with a summation of the successive temporal-difference predictions. However, how is that specific 1-step TD learning rule exactly related to the usual learning rules of (tabular) temporal difference methods, where apparently no gradient is needed?

– nbro
7 hours ago

@nbro You can view tabular methods as methods using linear function "approximation", where there is a single binary feature for every possible state-action pair. Then there would be a gradient needed, but the gradient would simply be $1$ for the "binary feature" corresponding to the state-action pair, and $0$ everywhere else.

– Dennis Soemers
7 hours ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Artificial Intelligence Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Mdthbs