Which likelihood function is used in linear regression?Maximizing: likelihood vs likelihood ratioWhen would...
Y2K... in 2019?
Acceptable to cut steak before searing?
Visa National - No Exit Stamp From France on Return to the UK
How does The Fools Guild make its money?
Performance of a branch and bound algorithm VS branch-cut-heuristics
(11 of 11: Meta) What is Pyramid Cult's All-Time Favorite?
What does Apple mean by "This may decrease battery life"?
What is my malfunctioning AI harvesting from humans?
Blocking people from taking pictures of me with smartphone
First amendment and employment: Can an employer terminate you for speech?
Infeasibility in mathematical optimization models
Double blind peer review when paper cites author's GitHub repo for code
Why are the inside diameters of some pipe larger than the stated size?
What are the uses and limitations of Persuasion, Insight, and Deception against other PCs?
sed delete all the words before a match
Non-OR journals which regularly publish OR research
Was this a rapid SCHEDULED disassembly? How was it done?
Are any jet engines used in combat aircraft water cooled?
changing number of arguments to a function in secondary evaluation
How to mark beverage cans in a cooler for a blind person?
Best gun to modify into a monsterhunter weapon?
Why is there a need to prevent a racist, sexist, or otherwise bigoted vendor from discriminating who they sell to?
Accidentals - some in brackets, some not
Is it incorrect to write "I rate this book a 3 out of 4 stars?"
Which likelihood function is used in linear regression?
Maximizing: likelihood vs likelihood ratioWhen would maximum likelihood estimates equal least squares estimates?Comparing maximum likelihood estimation (MLE) and Bayes' TheoremMaximizing likelihood versus MCMC sampling: Comparing Parameters and DevianceLikelihood - Why multiply?AIC only applicable to maximum likelihood fit (not least squares)?Why does Maximum Likelihood estimation maximizes probability density instead of probabilitylinear regression with gaussian distributionUnderstand a statement about likelihood function
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}
$begingroup$
When trying to derive the maximum likelihood estimation for a linear regression, We start by a likelihood function. Does it matter if we use either of these 2 forms?
$P(y|x,w)$
$P(y,x|w)$
All pages that I read on the internet use the first one.
I found that $P(y,x|w)$ is equal to $P(y|x,w)*P(x)$
so maximizing $P(y,x|w)$ with respect to $w$ is the same as $P(y|x,w)$ because $x$ and $w$ are independent.
The second function looks better because it means "what is the probability of the parameter giving the data(x AND y)" but the first function doesn't show that.
Is my point correct or not?Is there any difference?
regression mathematical-statistics maximum-likelihood
$endgroup$
add a comment |
$begingroup$
When trying to derive the maximum likelihood estimation for a linear regression, We start by a likelihood function. Does it matter if we use either of these 2 forms?
$P(y|x,w)$
$P(y,x|w)$
All pages that I read on the internet use the first one.
I found that $P(y,x|w)$ is equal to $P(y|x,w)*P(x)$
so maximizing $P(y,x|w)$ with respect to $w$ is the same as $P(y|x,w)$ because $x$ and $w$ are independent.
The second function looks better because it means "what is the probability of the parameter giving the data(x AND y)" but the first function doesn't show that.
Is my point correct or not?Is there any difference?
regression mathematical-statistics maximum-likelihood
$endgroup$
add a comment |
$begingroup$
When trying to derive the maximum likelihood estimation for a linear regression, We start by a likelihood function. Does it matter if we use either of these 2 forms?
$P(y|x,w)$
$P(y,x|w)$
All pages that I read on the internet use the first one.
I found that $P(y,x|w)$ is equal to $P(y|x,w)*P(x)$
so maximizing $P(y,x|w)$ with respect to $w$ is the same as $P(y|x,w)$ because $x$ and $w$ are independent.
The second function looks better because it means "what is the probability of the parameter giving the data(x AND y)" but the first function doesn't show that.
Is my point correct or not?Is there any difference?
regression mathematical-statistics maximum-likelihood
$endgroup$
When trying to derive the maximum likelihood estimation for a linear regression, We start by a likelihood function. Does it matter if we use either of these 2 forms?
$P(y|x,w)$
$P(y,x|w)$
All pages that I read on the internet use the first one.
I found that $P(y,x|w)$ is equal to $P(y|x,w)*P(x)$
so maximizing $P(y,x|w)$ with respect to $w$ is the same as $P(y|x,w)$ because $x$ and $w$ are independent.
The second function looks better because it means "what is the probability of the parameter giving the data(x AND y)" but the first function doesn't show that.
Is my point correct or not?Is there any difference?
regression mathematical-statistics maximum-likelihood
regression mathematical-statistics maximum-likelihood
edited 9 hours ago
floyd
asked 10 hours ago
floydfloyd
3445 silver badges20 bronze badges
3445 silver badges20 bronze badges
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
As you say, in this case the choice doesn't matter; the maximization over $w$ is the same. But you touch on a good point.
Typically for regression problems we prefer to write the version with the conditioning on $x$ because we think of regression as modeling the conditional distribution $p(y | x)$ while ignoring the details of what $p(x)$ may look like. This is because for the model $Y = f(X) + varepsilon$, where $varepsilon sim mathcal{N}(0, sigma^2)$, we want the estimate function $hat{f}$ we produce to be as close to $f$ as possible on average, measured by, say, least squares. This is the same thing as pulling $hat{f}$ close to $f$ on all its input values, individually---it doesn't matter the distribution $p(x)$ of how the inputs are scattered, as long as $hat{f}$ is close to $f$ on those values. We only want to model how $y$ responds to $x$, and we don't care about how $x$ itself.
More mathematically, from an MLE standpoint the goal is to get our likelihoods $P(x,y|w)$ as large as possible, but as you say, we are not modeling $P(x)$ through $w$, so $P(x)$ factors out and doesn't matter. From a fitting standpoint, if we want to minimize expected prediction error over $f$,
$$min_f text{EPE}(f) = min_f mathbb{E}(Y - f(X))^2$$
omitting some computation we obtain the minimizing $f$ to be
$$f(x) = mathbb{E}(Y | X = x)$$
so for least squares loss, the best possible $f$ depends only on the conditional distribution $p(y | x)$, and estimating it should not require any additional information.
However, in classification problems (logistic regression, linear discriminant analysis, Naive Bayes) there is a difference between a "generative" and a "discriminative" model; generative models do not condition on $x$ and discriminative models do. For instance, in linear discriminant analysis we do model $P(x)$ through $w$---we also do not use least squares loss in those.
$endgroup$
$begingroup$
Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
$endgroup$
– floyd
8 hours ago
1
$begingroup$
I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
$endgroup$
– Drew N
7 hours ago
add a comment |
$begingroup$
That's a good question since the difference is a bit subtle - hopefully this helps.
The difference between in calculating $p(y|x,w)$ and $p(y,x|w)$ lies in whether you model X as being fixed or random. In the first, you're looking for the joint probability of X and Y conditioned on w, whereas on the 2nd one, you want to determine the probability of Y conditioned on X and w (usually denoted as $beta$ in statistical settings).
Usually, in simple linear regression,
$$Y = beta_0 + beta_1 X + epsilon$$
you model X as being fixed and independent from $epsilon$ which follows a $sim N(0,sigma^2)$ distribution. That is, Y is modeled as a linear function of some fixed input (X) with some random noise ($epsilon$). Therefore, it makes sense to model it as $p(y|X,w)$ in this setting as X is fixed or modeled as constant, which is usually denoted explicitly as $X=x$.
Mathematically, when X is fixed, p(X) =1 as it is constant. Therefore
$$p(y,X|w) = p(y|X,w)p(X) implies p(y,X|w)= p(y|X,w)$$
If X is not constant and given, then p(X) is no longer 1, and the above doesn't hold, so it all comes down to whether you model X as being random or fixed.
$endgroup$
$begingroup$
Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
$endgroup$
– floyd
8 hours ago
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f421598%2fwhich-likelihood-function-is-used-in-linear-regression%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
As you say, in this case the choice doesn't matter; the maximization over $w$ is the same. But you touch on a good point.
Typically for regression problems we prefer to write the version with the conditioning on $x$ because we think of regression as modeling the conditional distribution $p(y | x)$ while ignoring the details of what $p(x)$ may look like. This is because for the model $Y = f(X) + varepsilon$, where $varepsilon sim mathcal{N}(0, sigma^2)$, we want the estimate function $hat{f}$ we produce to be as close to $f$ as possible on average, measured by, say, least squares. This is the same thing as pulling $hat{f}$ close to $f$ on all its input values, individually---it doesn't matter the distribution $p(x)$ of how the inputs are scattered, as long as $hat{f}$ is close to $f$ on those values. We only want to model how $y$ responds to $x$, and we don't care about how $x$ itself.
More mathematically, from an MLE standpoint the goal is to get our likelihoods $P(x,y|w)$ as large as possible, but as you say, we are not modeling $P(x)$ through $w$, so $P(x)$ factors out and doesn't matter. From a fitting standpoint, if we want to minimize expected prediction error over $f$,
$$min_f text{EPE}(f) = min_f mathbb{E}(Y - f(X))^2$$
omitting some computation we obtain the minimizing $f$ to be
$$f(x) = mathbb{E}(Y | X = x)$$
so for least squares loss, the best possible $f$ depends only on the conditional distribution $p(y | x)$, and estimating it should not require any additional information.
However, in classification problems (logistic regression, linear discriminant analysis, Naive Bayes) there is a difference between a "generative" and a "discriminative" model; generative models do not condition on $x$ and discriminative models do. For instance, in linear discriminant analysis we do model $P(x)$ through $w$---we also do not use least squares loss in those.
$endgroup$
$begingroup$
Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
$endgroup$
– floyd
8 hours ago
1
$begingroup$
I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
$endgroup$
– Drew N
7 hours ago
add a comment |
$begingroup$
As you say, in this case the choice doesn't matter; the maximization over $w$ is the same. But you touch on a good point.
Typically for regression problems we prefer to write the version with the conditioning on $x$ because we think of regression as modeling the conditional distribution $p(y | x)$ while ignoring the details of what $p(x)$ may look like. This is because for the model $Y = f(X) + varepsilon$, where $varepsilon sim mathcal{N}(0, sigma^2)$, we want the estimate function $hat{f}$ we produce to be as close to $f$ as possible on average, measured by, say, least squares. This is the same thing as pulling $hat{f}$ close to $f$ on all its input values, individually---it doesn't matter the distribution $p(x)$ of how the inputs are scattered, as long as $hat{f}$ is close to $f$ on those values. We only want to model how $y$ responds to $x$, and we don't care about how $x$ itself.
More mathematically, from an MLE standpoint the goal is to get our likelihoods $P(x,y|w)$ as large as possible, but as you say, we are not modeling $P(x)$ through $w$, so $P(x)$ factors out and doesn't matter. From a fitting standpoint, if we want to minimize expected prediction error over $f$,
$$min_f text{EPE}(f) = min_f mathbb{E}(Y - f(X))^2$$
omitting some computation we obtain the minimizing $f$ to be
$$f(x) = mathbb{E}(Y | X = x)$$
so for least squares loss, the best possible $f$ depends only on the conditional distribution $p(y | x)$, and estimating it should not require any additional information.
However, in classification problems (logistic regression, linear discriminant analysis, Naive Bayes) there is a difference between a "generative" and a "discriminative" model; generative models do not condition on $x$ and discriminative models do. For instance, in linear discriminant analysis we do model $P(x)$ through $w$---we also do not use least squares loss in those.
$endgroup$
$begingroup$
Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
$endgroup$
– floyd
8 hours ago
1
$begingroup$
I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
$endgroup$
– Drew N
7 hours ago
add a comment |
$begingroup$
As you say, in this case the choice doesn't matter; the maximization over $w$ is the same. But you touch on a good point.
Typically for regression problems we prefer to write the version with the conditioning on $x$ because we think of regression as modeling the conditional distribution $p(y | x)$ while ignoring the details of what $p(x)$ may look like. This is because for the model $Y = f(X) + varepsilon$, where $varepsilon sim mathcal{N}(0, sigma^2)$, we want the estimate function $hat{f}$ we produce to be as close to $f$ as possible on average, measured by, say, least squares. This is the same thing as pulling $hat{f}$ close to $f$ on all its input values, individually---it doesn't matter the distribution $p(x)$ of how the inputs are scattered, as long as $hat{f}$ is close to $f$ on those values. We only want to model how $y$ responds to $x$, and we don't care about how $x$ itself.
More mathematically, from an MLE standpoint the goal is to get our likelihoods $P(x,y|w)$ as large as possible, but as you say, we are not modeling $P(x)$ through $w$, so $P(x)$ factors out and doesn't matter. From a fitting standpoint, if we want to minimize expected prediction error over $f$,
$$min_f text{EPE}(f) = min_f mathbb{E}(Y - f(X))^2$$
omitting some computation we obtain the minimizing $f$ to be
$$f(x) = mathbb{E}(Y | X = x)$$
so for least squares loss, the best possible $f$ depends only on the conditional distribution $p(y | x)$, and estimating it should not require any additional information.
However, in classification problems (logistic regression, linear discriminant analysis, Naive Bayes) there is a difference between a "generative" and a "discriminative" model; generative models do not condition on $x$ and discriminative models do. For instance, in linear discriminant analysis we do model $P(x)$ through $w$---we also do not use least squares loss in those.
$endgroup$
As you say, in this case the choice doesn't matter; the maximization over $w$ is the same. But you touch on a good point.
Typically for regression problems we prefer to write the version with the conditioning on $x$ because we think of regression as modeling the conditional distribution $p(y | x)$ while ignoring the details of what $p(x)$ may look like. This is because for the model $Y = f(X) + varepsilon$, where $varepsilon sim mathcal{N}(0, sigma^2)$, we want the estimate function $hat{f}$ we produce to be as close to $f$ as possible on average, measured by, say, least squares. This is the same thing as pulling $hat{f}$ close to $f$ on all its input values, individually---it doesn't matter the distribution $p(x)$ of how the inputs are scattered, as long as $hat{f}$ is close to $f$ on those values. We only want to model how $y$ responds to $x$, and we don't care about how $x$ itself.
More mathematically, from an MLE standpoint the goal is to get our likelihoods $P(x,y|w)$ as large as possible, but as you say, we are not modeling $P(x)$ through $w$, so $P(x)$ factors out and doesn't matter. From a fitting standpoint, if we want to minimize expected prediction error over $f$,
$$min_f text{EPE}(f) = min_f mathbb{E}(Y - f(X))^2$$
omitting some computation we obtain the minimizing $f$ to be
$$f(x) = mathbb{E}(Y | X = x)$$
so for least squares loss, the best possible $f$ depends only on the conditional distribution $p(y | x)$, and estimating it should not require any additional information.
However, in classification problems (logistic regression, linear discriminant analysis, Naive Bayes) there is a difference between a "generative" and a "discriminative" model; generative models do not condition on $x$ and discriminative models do. For instance, in linear discriminant analysis we do model $P(x)$ through $w$---we also do not use least squares loss in those.
answered 8 hours ago
Drew N Drew N
3562 silver badges8 bronze badges
3562 silver badges8 bronze badges
$begingroup$
Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
$endgroup$
– floyd
8 hours ago
1
$begingroup$
I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
$endgroup$
– Drew N
7 hours ago
add a comment |
$begingroup$
Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
$endgroup$
– floyd
8 hours ago
1
$begingroup$
I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
$endgroup$
– Drew N
7 hours ago
$begingroup$
Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
$endgroup$
– floyd
8 hours ago
$begingroup$
Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
$endgroup$
– floyd
8 hours ago
1
1
$begingroup$
I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
$endgroup$
– Drew N
7 hours ago
$begingroup$
I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
$endgroup$
– Drew N
7 hours ago
add a comment |
$begingroup$
That's a good question since the difference is a bit subtle - hopefully this helps.
The difference between in calculating $p(y|x,w)$ and $p(y,x|w)$ lies in whether you model X as being fixed or random. In the first, you're looking for the joint probability of X and Y conditioned on w, whereas on the 2nd one, you want to determine the probability of Y conditioned on X and w (usually denoted as $beta$ in statistical settings).
Usually, in simple linear regression,
$$Y = beta_0 + beta_1 X + epsilon$$
you model X as being fixed and independent from $epsilon$ which follows a $sim N(0,sigma^2)$ distribution. That is, Y is modeled as a linear function of some fixed input (X) with some random noise ($epsilon$). Therefore, it makes sense to model it as $p(y|X,w)$ in this setting as X is fixed or modeled as constant, which is usually denoted explicitly as $X=x$.
Mathematically, when X is fixed, p(X) =1 as it is constant. Therefore
$$p(y,X|w) = p(y|X,w)p(X) implies p(y,X|w)= p(y|X,w)$$
If X is not constant and given, then p(X) is no longer 1, and the above doesn't hold, so it all comes down to whether you model X as being random or fixed.
$endgroup$
$begingroup$
Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
$endgroup$
– floyd
8 hours ago
add a comment |
$begingroup$
That's a good question since the difference is a bit subtle - hopefully this helps.
The difference between in calculating $p(y|x,w)$ and $p(y,x|w)$ lies in whether you model X as being fixed or random. In the first, you're looking for the joint probability of X and Y conditioned on w, whereas on the 2nd one, you want to determine the probability of Y conditioned on X and w (usually denoted as $beta$ in statistical settings).
Usually, in simple linear regression,
$$Y = beta_0 + beta_1 X + epsilon$$
you model X as being fixed and independent from $epsilon$ which follows a $sim N(0,sigma^2)$ distribution. That is, Y is modeled as a linear function of some fixed input (X) with some random noise ($epsilon$). Therefore, it makes sense to model it as $p(y|X,w)$ in this setting as X is fixed or modeled as constant, which is usually denoted explicitly as $X=x$.
Mathematically, when X is fixed, p(X) =1 as it is constant. Therefore
$$p(y,X|w) = p(y|X,w)p(X) implies p(y,X|w)= p(y|X,w)$$
If X is not constant and given, then p(X) is no longer 1, and the above doesn't hold, so it all comes down to whether you model X as being random or fixed.
$endgroup$
$begingroup$
Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
$endgroup$
– floyd
8 hours ago
add a comment |
$begingroup$
That's a good question since the difference is a bit subtle - hopefully this helps.
The difference between in calculating $p(y|x,w)$ and $p(y,x|w)$ lies in whether you model X as being fixed or random. In the first, you're looking for the joint probability of X and Y conditioned on w, whereas on the 2nd one, you want to determine the probability of Y conditioned on X and w (usually denoted as $beta$ in statistical settings).
Usually, in simple linear regression,
$$Y = beta_0 + beta_1 X + epsilon$$
you model X as being fixed and independent from $epsilon$ which follows a $sim N(0,sigma^2)$ distribution. That is, Y is modeled as a linear function of some fixed input (X) with some random noise ($epsilon$). Therefore, it makes sense to model it as $p(y|X,w)$ in this setting as X is fixed or modeled as constant, which is usually denoted explicitly as $X=x$.
Mathematically, when X is fixed, p(X) =1 as it is constant. Therefore
$$p(y,X|w) = p(y|X,w)p(X) implies p(y,X|w)= p(y|X,w)$$
If X is not constant and given, then p(X) is no longer 1, and the above doesn't hold, so it all comes down to whether you model X as being random or fixed.
$endgroup$
That's a good question since the difference is a bit subtle - hopefully this helps.
The difference between in calculating $p(y|x,w)$ and $p(y,x|w)$ lies in whether you model X as being fixed or random. In the first, you're looking for the joint probability of X and Y conditioned on w, whereas on the 2nd one, you want to determine the probability of Y conditioned on X and w (usually denoted as $beta$ in statistical settings).
Usually, in simple linear regression,
$$Y = beta_0 + beta_1 X + epsilon$$
you model X as being fixed and independent from $epsilon$ which follows a $sim N(0,sigma^2)$ distribution. That is, Y is modeled as a linear function of some fixed input (X) with some random noise ($epsilon$). Therefore, it makes sense to model it as $p(y|X,w)$ in this setting as X is fixed or modeled as constant, which is usually denoted explicitly as $X=x$.
Mathematically, when X is fixed, p(X) =1 as it is constant. Therefore
$$p(y,X|w) = p(y|X,w)p(X) implies p(y,X|w)= p(y|X,w)$$
If X is not constant and given, then p(X) is no longer 1, and the above doesn't hold, so it all comes down to whether you model X as being random or fixed.
answered 9 hours ago
Samir Rachid ZaimSamir Rachid Zaim
864 bronze badges
864 bronze badges
$begingroup$
Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
$endgroup$
– floyd
8 hours ago
add a comment |
$begingroup$
Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
$endgroup$
– floyd
8 hours ago
$begingroup$
Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
$endgroup$
– floyd
8 hours ago
$begingroup$
Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
$endgroup$
– floyd
8 hours ago
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f421598%2fwhich-likelihood-function-is-used-in-linear-regression%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown