How is TD(0) method helpful? What good does it do?

What is in `tex.print` or `tex.sprint`?

How to supress loops in a digraph?

How do photons get into the eyes?

Did thousands of women die every year due to illegal abortions before Roe v. Wade?

How bad would a partial hash leak be, realistically?

Whats the next step after commercial fusion reactors?

Will TSA allow me to carry a Continuous Positive Airway Pressure (CPAP)/sleep apnea device?

Identification quotas - TIKZ LaTeX

Bent spoke design wheels — feasible?

Is there any word or phrase for negative bearing?

Java guess the number

You've spoiled/damaged the card

Sharing one invocation list between multiple events on the same object in C#

Through what methods and mechanisms can a multi-material FDM printer operate?

Movie where a boy is transported into the future by an alien spaceship

How is it possible that Gollum speaks Westron?

What makes linear regression with polynomial features curvy?

What happens to foam insulation board after you pour concrete slab?

Why don't B747s start takeoffs with full throttle?

How to skip replacing first occurrence of a character in each line?

Do manufacturers try make their components as close to ideal ones as possible?

What is the purpose of building foundations?

Why don’t airliners have temporary liveries?

Are there cubesats in GEO?



How is TD(0) method helpful? What good does it do?







.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}







3












$begingroup$


What is the use of TD(0) method when we talk about temporal difference learning?



The weights in the temporal difference learning are updated as given by the equation (can be referenced as equation number 4 here) :



enter image description here



When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0 , there will never be a change in weight and hence no learning.



When can TD(0) be used? When is it helpful?



I cannot see any reason to use TD(0) method. Am I missing anything?










share|improve this question











$endgroup$












  • $begingroup$
    TD(0) is better than MC in most cases. Although it might not appear so but think of TD(0) (updating value of a state based on value of next state) as gaining the experience of all the paths that went through that 'next state's, whereas in MC method you use gain experience only from a single path.
    $endgroup$
    – DuttaA
    9 hours ago






  • 1




    $begingroup$
    I assume they use the convention 0^0 = 1 so lambda basically is 1.
    $endgroup$
    – Hanzy
    9 hours ago


















3












$begingroup$


What is the use of TD(0) method when we talk about temporal difference learning?



The weights in the temporal difference learning are updated as given by the equation (can be referenced as equation number 4 here) :



enter image description here



When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0 , there will never be a change in weight and hence no learning.



When can TD(0) be used? When is it helpful?



I cannot see any reason to use TD(0) method. Am I missing anything?










share|improve this question











$endgroup$












  • $begingroup$
    TD(0) is better than MC in most cases. Although it might not appear so but think of TD(0) (updating value of a state based on value of next state) as gaining the experience of all the paths that went through that 'next state's, whereas in MC method you use gain experience only from a single path.
    $endgroup$
    – DuttaA
    9 hours ago






  • 1




    $begingroup$
    I assume they use the convention 0^0 = 1 so lambda basically is 1.
    $endgroup$
    – Hanzy
    9 hours ago














3












3








3


1



$begingroup$


What is the use of TD(0) method when we talk about temporal difference learning?



The weights in the temporal difference learning are updated as given by the equation (can be referenced as equation number 4 here) :



enter image description here



When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0 , there will never be a change in weight and hence no learning.



When can TD(0) be used? When is it helpful?



I cannot see any reason to use TD(0) method. Am I missing anything?










share|improve this question











$endgroup$




What is the use of TD(0) method when we talk about temporal difference learning?



The weights in the temporal difference learning are updated as given by the equation (can be referenced as equation number 4 here) :



enter image description here



When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0 , there will never be a change in weight and hence no learning.



When can TD(0) be used? When is it helpful?



I cannot see any reason to use TD(0) method. Am I missing anything?







reinforcement-learning temporal-difference notation






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 9 hours ago









nbro

3,6772826




3,6772826










asked 10 hours ago









AmandaAmanda

454




454












  • $begingroup$
    TD(0) is better than MC in most cases. Although it might not appear so but think of TD(0) (updating value of a state based on value of next state) as gaining the experience of all the paths that went through that 'next state's, whereas in MC method you use gain experience only from a single path.
    $endgroup$
    – DuttaA
    9 hours ago






  • 1




    $begingroup$
    I assume they use the convention 0^0 = 1 so lambda basically is 1.
    $endgroup$
    – Hanzy
    9 hours ago


















  • $begingroup$
    TD(0) is better than MC in most cases. Although it might not appear so but think of TD(0) (updating value of a state based on value of next state) as gaining the experience of all the paths that went through that 'next state's, whereas in MC method you use gain experience only from a single path.
    $endgroup$
    – DuttaA
    9 hours ago






  • 1




    $begingroup$
    I assume they use the convention 0^0 = 1 so lambda basically is 1.
    $endgroup$
    – Hanzy
    9 hours ago
















$begingroup$
TD(0) is better than MC in most cases. Although it might not appear so but think of TD(0) (updating value of a state based on value of next state) as gaining the experience of all the paths that went through that 'next state's, whereas in MC method you use gain experience only from a single path.
$endgroup$
– DuttaA
9 hours ago




$begingroup$
TD(0) is better than MC in most cases. Although it might not appear so but think of TD(0) (updating value of a state based on value of next state) as gaining the experience of all the paths that went through that 'next state's, whereas in MC method you use gain experience only from a single path.
$endgroup$
– DuttaA
9 hours ago




1




1




$begingroup$
I assume they use the convention 0^0 = 1 so lambda basically is 1.
$endgroup$
– Hanzy
9 hours ago




$begingroup$
I assume they use the convention 0^0 = 1 so lambda basically is 1.
$endgroup$
– Hanzy
9 hours ago










1 Answer
1






active

oldest

votes


















5












$begingroup$


When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0, there will never be a change in weight and hence no learning.




I think the detail that you're missing is that one of the terms in the sum (the final "iteration" of the sum, the case where $k = t$) has $lambda$ raised to the power $0$, and anything raised to the power $0$ (even $0$) is equal to $1$. So, for $lambda = 0$, your update equation becomes



$$Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t,$$



which is basically a one-step update (just like Sarsa).






share|improve this answer









$endgroup$













  • $begingroup$
    At page $16$ of the same paper Learning to Predict by the Methods of Temporal Differences (1988), Sutton actually states that $Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t$ is the learning rule when $lambda = 0$.
    $endgroup$
    – nbro
    7 hours ago










  • $begingroup$
    He starts with the supervised setting and then derives the Widrow-Hoff (or delta) rule. The TD rule is then a special case of the delta rule, where the errors $z - P_t$ are replaced with a summation of the successive temporal-difference predictions. However, how is that specific 1-step TD learning rule exactly related to the usual learning rules of (tabular) temporal difference methods, where apparently no gradient is needed?
    $endgroup$
    – nbro
    7 hours ago












  • $begingroup$
    @nbro You can view tabular methods as methods using linear function "approximation", where there is a single binary feature for every possible state-action pair. Then there would be a gradient needed, but the gradient would simply be $1$ for the "binary feature" corresponding to the state-action pair, and $0$ everywhere else.
    $endgroup$
    – Dennis Soemers
    7 hours ago












Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "658"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f12639%2fhow-is-td0-method-helpful-what-good-does-it-do%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









5












$begingroup$


When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0, there will never be a change in weight and hence no learning.




I think the detail that you're missing is that one of the terms in the sum (the final "iteration" of the sum, the case where $k = t$) has $lambda$ raised to the power $0$, and anything raised to the power $0$ (even $0$) is equal to $1$. So, for $lambda = 0$, your update equation becomes



$$Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t,$$



which is basically a one-step update (just like Sarsa).






share|improve this answer









$endgroup$













  • $begingroup$
    At page $16$ of the same paper Learning to Predict by the Methods of Temporal Differences (1988), Sutton actually states that $Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t$ is the learning rule when $lambda = 0$.
    $endgroup$
    – nbro
    7 hours ago










  • $begingroup$
    He starts with the supervised setting and then derives the Widrow-Hoff (or delta) rule. The TD rule is then a special case of the delta rule, where the errors $z - P_t$ are replaced with a summation of the successive temporal-difference predictions. However, how is that specific 1-step TD learning rule exactly related to the usual learning rules of (tabular) temporal difference methods, where apparently no gradient is needed?
    $endgroup$
    – nbro
    7 hours ago












  • $begingroup$
    @nbro You can view tabular methods as methods using linear function "approximation", where there is a single binary feature for every possible state-action pair. Then there would be a gradient needed, but the gradient would simply be $1$ for the "binary feature" corresponding to the state-action pair, and $0$ everywhere else.
    $endgroup$
    – Dennis Soemers
    7 hours ago
















5












$begingroup$


When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0, there will never be a change in weight and hence no learning.




I think the detail that you're missing is that one of the terms in the sum (the final "iteration" of the sum, the case where $k = t$) has $lambda$ raised to the power $0$, and anything raised to the power $0$ (even $0$) is equal to $1$. So, for $lambda = 0$, your update equation becomes



$$Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t,$$



which is basically a one-step update (just like Sarsa).






share|improve this answer









$endgroup$













  • $begingroup$
    At page $16$ of the same paper Learning to Predict by the Methods of Temporal Differences (1988), Sutton actually states that $Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t$ is the learning rule when $lambda = 0$.
    $endgroup$
    – nbro
    7 hours ago










  • $begingroup$
    He starts with the supervised setting and then derives the Widrow-Hoff (or delta) rule. The TD rule is then a special case of the delta rule, where the errors $z - P_t$ are replaced with a summation of the successive temporal-difference predictions. However, how is that specific 1-step TD learning rule exactly related to the usual learning rules of (tabular) temporal difference methods, where apparently no gradient is needed?
    $endgroup$
    – nbro
    7 hours ago












  • $begingroup$
    @nbro You can view tabular methods as methods using linear function "approximation", where there is a single binary feature for every possible state-action pair. Then there would be a gradient needed, but the gradient would simply be $1$ for the "binary feature" corresponding to the state-action pair, and $0$ everywhere else.
    $endgroup$
    – Dennis Soemers
    7 hours ago














5












5








5





$begingroup$


When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0, there will never be a change in weight and hence no learning.




I think the detail that you're missing is that one of the terms in the sum (the final "iteration" of the sum, the case where $k = t$) has $lambda$ raised to the power $0$, and anything raised to the power $0$ (even $0$) is equal to $1$. So, for $lambda = 0$, your update equation becomes



$$Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t,$$



which is basically a one-step update (just like Sarsa).






share|improve this answer









$endgroup$




When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0, there will never be a change in weight and hence no learning.




I think the detail that you're missing is that one of the terms in the sum (the final "iteration" of the sum, the case where $k = t$) has $lambda$ raised to the power $0$, and anything raised to the power $0$ (even $0$) is equal to $1$. So, for $lambda = 0$, your update equation becomes



$$Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t,$$



which is basically a one-step update (just like Sarsa).







share|improve this answer












share|improve this answer



share|improve this answer










answered 9 hours ago









Dennis SoemersDennis Soemers

4,4621437




4,4621437












  • $begingroup$
    At page $16$ of the same paper Learning to Predict by the Methods of Temporal Differences (1988), Sutton actually states that $Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t$ is the learning rule when $lambda = 0$.
    $endgroup$
    – nbro
    7 hours ago










  • $begingroup$
    He starts with the supervised setting and then derives the Widrow-Hoff (or delta) rule. The TD rule is then a special case of the delta rule, where the errors $z - P_t$ are replaced with a summation of the successive temporal-difference predictions. However, how is that specific 1-step TD learning rule exactly related to the usual learning rules of (tabular) temporal difference methods, where apparently no gradient is needed?
    $endgroup$
    – nbro
    7 hours ago












  • $begingroup$
    @nbro You can view tabular methods as methods using linear function "approximation", where there is a single binary feature for every possible state-action pair. Then there would be a gradient needed, but the gradient would simply be $1$ for the "binary feature" corresponding to the state-action pair, and $0$ everywhere else.
    $endgroup$
    – Dennis Soemers
    7 hours ago


















  • $begingroup$
    At page $16$ of the same paper Learning to Predict by the Methods of Temporal Differences (1988), Sutton actually states that $Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t$ is the learning rule when $lambda = 0$.
    $endgroup$
    – nbro
    7 hours ago










  • $begingroup$
    He starts with the supervised setting and then derives the Widrow-Hoff (or delta) rule. The TD rule is then a special case of the delta rule, where the errors $z - P_t$ are replaced with a summation of the successive temporal-difference predictions. However, how is that specific 1-step TD learning rule exactly related to the usual learning rules of (tabular) temporal difference methods, where apparently no gradient is needed?
    $endgroup$
    – nbro
    7 hours ago












  • $begingroup$
    @nbro You can view tabular methods as methods using linear function "approximation", where there is a single binary feature for every possible state-action pair. Then there would be a gradient needed, but the gradient would simply be $1$ for the "binary feature" corresponding to the state-action pair, and $0$ everywhere else.
    $endgroup$
    – Dennis Soemers
    7 hours ago
















$begingroup$
At page $16$ of the same paper Learning to Predict by the Methods of Temporal Differences (1988), Sutton actually states that $Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t$ is the learning rule when $lambda = 0$.
$endgroup$
– nbro
7 hours ago




$begingroup$
At page $16$ of the same paper Learning to Predict by the Methods of Temporal Differences (1988), Sutton actually states that $Delta w_t = alpha left( P_{t+1} - P_t right) nabla_w P_t$ is the learning rule when $lambda = 0$.
$endgroup$
– nbro
7 hours ago












$begingroup$
He starts with the supervised setting and then derives the Widrow-Hoff (or delta) rule. The TD rule is then a special case of the delta rule, where the errors $z - P_t$ are replaced with a summation of the successive temporal-difference predictions. However, how is that specific 1-step TD learning rule exactly related to the usual learning rules of (tabular) temporal difference methods, where apparently no gradient is needed?
$endgroup$
– nbro
7 hours ago






$begingroup$
He starts with the supervised setting and then derives the Widrow-Hoff (or delta) rule. The TD rule is then a special case of the delta rule, where the errors $z - P_t$ are replaced with a summation of the successive temporal-difference predictions. However, how is that specific 1-step TD learning rule exactly related to the usual learning rules of (tabular) temporal difference methods, where apparently no gradient is needed?
$endgroup$
– nbro
7 hours ago














$begingroup$
@nbro You can view tabular methods as methods using linear function "approximation", where there is a single binary feature for every possible state-action pair. Then there would be a gradient needed, but the gradient would simply be $1$ for the "binary feature" corresponding to the state-action pair, and $0$ everywhere else.
$endgroup$
– Dennis Soemers
7 hours ago




$begingroup$
@nbro You can view tabular methods as methods using linear function "approximation", where there is a single binary feature for every possible state-action pair. Then there would be a gradient needed, but the gradient would simply be $1$ for the "binary feature" corresponding to the state-action pair, and $0$ everywhere else.
$endgroup$
– Dennis Soemers
7 hours ago


















draft saved

draft discarded




















































Thanks for contributing an answer to Artificial Intelligence Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f12639%2fhow-is-td0-method-helpful-what-good-does-it-do%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Hudson River Historic District Contents Geography History The district today Aesthetics Cultural...

The number designs the writing. Feandra Aversely Definition: The act of ingrafting a sprig or shoot of one...

Ayherre Geografie Demografie Externe links Navigatiemenu43° 23′ NB, 1° 15′ WL43° 23′ NB, 1°...