How do I overfit?Classification probability thresholdRandom Forest can't overfit?Doesn't Factor Analysis...
Does a card have a keyword if it has the same effect as said keyword?
How can I support myself financially as a 17 year old with a loan?
Would glacier 'trees' be plausible?
Why isn't nylon as strong as kevlar?
Mic, cable, pre-amp setup for acoustic guitar to perform with big band through mic and guitar amp?
Out of scope work duties and resignation
Manager is threatening to grade me poorly if I don't complete the project
Would Hubble Space Telescope improve black hole image observed by EHT if it joined array of telesopes?
Can an isometry leave entropy invariant?
How does this change to the opportunity attack rule impact combat?
Why Isn’t SQL More Refactorable?
Why do only some White Walkers shatter into ice chips?
Can a nothic's Weird Insight action discover secrets about a player character that the character doesn't know about themselves?
How do I tell my manager that his code review comment is wrong?
Which module had more 'comfort' in terms of living space, the Lunar Module or the Command module?
What was the design of the Macintosh II's MMU replacement?
Did we get closer to another plane than we were supposed to, or was the pilot just protecting our delicate sensibilities?
Building a list of products from the elements in another list
Getting a W on your transcript for grad school applications
Why doesn't WotC use established keywords on all new cards?
Can you complete the sequence?
If your medical expenses exceed your income does the IRS pay you?
Where can I go to avoid planes overhead?
How did Kirk identify Gorgan in "And the Children Shall Lead"?
How do I overfit?
Classification probability thresholdRandom Forest can't overfit?Doesn't Factor Analysis always overfit on a theoretical basisDiscussion about overfit in xgboostWhy is boosting less likely to overfit?Decision tree does not overfit, why?Cannot overfit on the IRIS datasetNeural Network: Why can't I overfit?On which datasets does AdaBoost algorithm overfit?Why don't GAN generators vastly overfit?Can't get a Keras model to overfit
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}
$begingroup$
This is a weird question, I know.
I'm just a noob and trying to learn about different classifier options and how they work. So I'm asking the question:
Given a dataset of n1-dimensions and n2-observations where each observation can be classified into n3-buckets, which algorithm most efficiently (ideally with just one training iteration) produces a model (classification boundary) that would perfectly classify every observation in the dataset (completely overfit)?
In other words, how does one most easily overfit?
(Please don't lecture me on 'not overfitting'. This is just for theoretical educational purposes.)
I have a suspicion that the answer something like this: "Well, if your number of dimensions is greater than your number of observations, use X algorithm to draw the boundary(ies), otherwise use Y algorithm."
I also have a suspicion that the answer will say: "You can draw a smooth boundary, but that more computationally expensive than drawing straight lines between all differing classified observations."
But that's as far as my intuition will guide me. Can you help?
I have a hand-drawn example of what I think I'm talking about in 2D with binary classification.
Basically, just split the difference, right? What algorithm does this efficiently for n-dimensions?

overfitting
$endgroup$
add a comment |
$begingroup$
This is a weird question, I know.
I'm just a noob and trying to learn about different classifier options and how they work. So I'm asking the question:
Given a dataset of n1-dimensions and n2-observations where each observation can be classified into n3-buckets, which algorithm most efficiently (ideally with just one training iteration) produces a model (classification boundary) that would perfectly classify every observation in the dataset (completely overfit)?
In other words, how does one most easily overfit?
(Please don't lecture me on 'not overfitting'. This is just for theoretical educational purposes.)
I have a suspicion that the answer something like this: "Well, if your number of dimensions is greater than your number of observations, use X algorithm to draw the boundary(ies), otherwise use Y algorithm."
I also have a suspicion that the answer will say: "You can draw a smooth boundary, but that more computationally expensive than drawing straight lines between all differing classified observations."
But that's as far as my intuition will guide me. Can you help?
I have a hand-drawn example of what I think I'm talking about in 2D with binary classification.
Basically, just split the difference, right? What algorithm does this efficiently for n-dimensions?

overfitting
$endgroup$
3
$begingroup$
knn with $k=1$?
$endgroup$
– shimao
5 hours ago
$begingroup$
@shimao I suppose that would work, wouldn't it? Yeah, I can't see why not. Thank you very much!
$endgroup$
– Legit Stack
5 hours ago
$begingroup$
@shimao Is that the most efficient way to encode the Boundary? Probably right? Since we don't know if the data is entirely random or not, using the data itself as the encoded model with KNN algorithm is probably the best you can generally do. Right?
$endgroup$
– Legit Stack
5 hours ago
2
$begingroup$
@shimao: do you want to post your comment as an answer (perhaps with a little more detail)?
$endgroup$
– Stephan Kolassa
4 hours ago
add a comment |
$begingroup$
This is a weird question, I know.
I'm just a noob and trying to learn about different classifier options and how they work. So I'm asking the question:
Given a dataset of n1-dimensions and n2-observations where each observation can be classified into n3-buckets, which algorithm most efficiently (ideally with just one training iteration) produces a model (classification boundary) that would perfectly classify every observation in the dataset (completely overfit)?
In other words, how does one most easily overfit?
(Please don't lecture me on 'not overfitting'. This is just for theoretical educational purposes.)
I have a suspicion that the answer something like this: "Well, if your number of dimensions is greater than your number of observations, use X algorithm to draw the boundary(ies), otherwise use Y algorithm."
I also have a suspicion that the answer will say: "You can draw a smooth boundary, but that more computationally expensive than drawing straight lines between all differing classified observations."
But that's as far as my intuition will guide me. Can you help?
I have a hand-drawn example of what I think I'm talking about in 2D with binary classification.
Basically, just split the difference, right? What algorithm does this efficiently for n-dimensions?

overfitting
$endgroup$
This is a weird question, I know.
I'm just a noob and trying to learn about different classifier options and how they work. So I'm asking the question:
Given a dataset of n1-dimensions and n2-observations where each observation can be classified into n3-buckets, which algorithm most efficiently (ideally with just one training iteration) produces a model (classification boundary) that would perfectly classify every observation in the dataset (completely overfit)?
In other words, how does one most easily overfit?
(Please don't lecture me on 'not overfitting'. This is just for theoretical educational purposes.)
I have a suspicion that the answer something like this: "Well, if your number of dimensions is greater than your number of observations, use X algorithm to draw the boundary(ies), otherwise use Y algorithm."
I also have a suspicion that the answer will say: "You can draw a smooth boundary, but that more computationally expensive than drawing straight lines between all differing classified observations."
But that's as far as my intuition will guide me. Can you help?
I have a hand-drawn example of what I think I'm talking about in 2D with binary classification.
Basically, just split the difference, right? What algorithm does this efficiently for n-dimensions?

overfitting
overfitting
edited 50 mins ago
Peter Mortensen
20528
20528
asked 5 hours ago
Legit StackLegit Stack
1355
1355
3
$begingroup$
knn with $k=1$?
$endgroup$
– shimao
5 hours ago
$begingroup$
@shimao I suppose that would work, wouldn't it? Yeah, I can't see why not. Thank you very much!
$endgroup$
– Legit Stack
5 hours ago
$begingroup$
@shimao Is that the most efficient way to encode the Boundary? Probably right? Since we don't know if the data is entirely random or not, using the data itself as the encoded model with KNN algorithm is probably the best you can generally do. Right?
$endgroup$
– Legit Stack
5 hours ago
2
$begingroup$
@shimao: do you want to post your comment as an answer (perhaps with a little more detail)?
$endgroup$
– Stephan Kolassa
4 hours ago
add a comment |
3
$begingroup$
knn with $k=1$?
$endgroup$
– shimao
5 hours ago
$begingroup$
@shimao I suppose that would work, wouldn't it? Yeah, I can't see why not. Thank you very much!
$endgroup$
– Legit Stack
5 hours ago
$begingroup$
@shimao Is that the most efficient way to encode the Boundary? Probably right? Since we don't know if the data is entirely random or not, using the data itself as the encoded model with KNN algorithm is probably the best you can generally do. Right?
$endgroup$
– Legit Stack
5 hours ago
2
$begingroup$
@shimao: do you want to post your comment as an answer (perhaps with a little more detail)?
$endgroup$
– Stephan Kolassa
4 hours ago
3
3
$begingroup$
knn with $k=1$?
$endgroup$
– shimao
5 hours ago
$begingroup$
knn with $k=1$?
$endgroup$
– shimao
5 hours ago
$begingroup$
@shimao I suppose that would work, wouldn't it? Yeah, I can't see why not. Thank you very much!
$endgroup$
– Legit Stack
5 hours ago
$begingroup$
@shimao I suppose that would work, wouldn't it? Yeah, I can't see why not. Thank you very much!
$endgroup$
– Legit Stack
5 hours ago
$begingroup$
@shimao Is that the most efficient way to encode the Boundary? Probably right? Since we don't know if the data is entirely random or not, using the data itself as the encoded model with KNN algorithm is probably the best you can generally do. Right?
$endgroup$
– Legit Stack
5 hours ago
$begingroup$
@shimao Is that the most efficient way to encode the Boundary? Probably right? Since we don't know if the data is entirely random or not, using the data itself as the encoded model with KNN algorithm is probably the best you can generally do. Right?
$endgroup$
– Legit Stack
5 hours ago
2
2
$begingroup$
@shimao: do you want to post your comment as an answer (perhaps with a little more detail)?
$endgroup$
– Stephan Kolassa
4 hours ago
$begingroup$
@shimao: do you want to post your comment as an answer (perhaps with a little more detail)?
$endgroup$
– Stephan Kolassa
4 hours ago
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
As long as all the observations are unique, then K-nearest neighbors with K set to 1 and with any arbitrary valid distance metric will give a classifier which perfectly fits the training set (since the nearest neighbor of every point in the training set is trivially, itself). And it's probably the most efficient since no training at all is needed.
Is that the most efficient way to encode the Boundary? Probably right?
Since we don't know if the data is entirely random or not, using the
data itself as the encoded model with KNN algorithm is probably the
best you can generally do. Right?
It's the most time-efficient, but not necessarily the most space efficient.
$endgroup$
$begingroup$
If you want space efficiency, store the hashcodes. Also note that you only need to analyze N-1 buckets.
$endgroup$
– Mooing Duck
1 hour ago
add a comment |
$begingroup$
You can't.
At least not in general, to the degree you want, if you want a perfect fit with arbitrary data and arbitrary dimensionality.
As an example, suppose we have $n_1=0$ predictor dimensions (i.e., none at all) and $n_2=2$ observations classified into $n_3=2$ buckets. The two observations are classified into two different buckets, namely "chocolate" and "vanilla".
Since you don't have any predictors, you will not be able to classify them perfectly, period.
If you have at least one predictor that takes different values on each observation, then you can indeed overfit arbitrarily badly, simply by using arbitrarily high polynomial orders for a numerical predictor (if the predictor is categorical with different values on each observation, you don't even need to transform). The tool or model is pretty much secondary. Yes, it's easy to overfit.
Here is an example. The 10 observations are completely independent of the single numerical predictor. We fit increasingly complex logistical regressions or powers of the predictor and classify using a threshold of 0.5 (which is not good practice). Correctly fitted points are marked in green, incorrectly fitted ones in red.

R code:
nn <- 10
set.seed(2)
predictor <- runif(nn)
outcome <- runif(nn)>0.5
plot(predictor,outcome,pch=19,yaxt="n",ylim=c(-0.1,1.6))
axis(2,c(0,1),c("FALSE","TRUE"))
orders <- c(1,2,3,5,7,9)
xx <- seq(min(predictor),max(predictor),0.01)
par(mfrow=c(3,2))
for ( kk in seq_along(orders) ) {
plot(predictor,outcome,pch=19,yaxt="n",ylim=c(-0.2,1.2),main=paste("Order:",orders[kk]))
axis(2,c(0,1),c("FALSE","TRUE"))
model <- glm(outcome~poly(predictor,orders[kk]),family="binomial")
fits_obs <- predict(model,type="response")
fits <- predict(model,newdata=data.frame(predictor=xx),type="response")
lines(xx,fits)
correct <- (fits_obs>0.5 & outcome) | ( fits_obs<0.5 & !outcome)
points(predictor[correct],outcome[correct],cex=1.4,col="green",pch="o")
points(predictor[!correct],outcome[!correct],cex=1.4,col="red",pch="o")
}
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f406122%2fhow-do-i-overfit%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
As long as all the observations are unique, then K-nearest neighbors with K set to 1 and with any arbitrary valid distance metric will give a classifier which perfectly fits the training set (since the nearest neighbor of every point in the training set is trivially, itself). And it's probably the most efficient since no training at all is needed.
Is that the most efficient way to encode the Boundary? Probably right?
Since we don't know if the data is entirely random or not, using the
data itself as the encoded model with KNN algorithm is probably the
best you can generally do. Right?
It's the most time-efficient, but not necessarily the most space efficient.
$endgroup$
$begingroup$
If you want space efficiency, store the hashcodes. Also note that you only need to analyze N-1 buckets.
$endgroup$
– Mooing Duck
1 hour ago
add a comment |
$begingroup$
As long as all the observations are unique, then K-nearest neighbors with K set to 1 and with any arbitrary valid distance metric will give a classifier which perfectly fits the training set (since the nearest neighbor of every point in the training set is trivially, itself). And it's probably the most efficient since no training at all is needed.
Is that the most efficient way to encode the Boundary? Probably right?
Since we don't know if the data is entirely random or not, using the
data itself as the encoded model with KNN algorithm is probably the
best you can generally do. Right?
It's the most time-efficient, but not necessarily the most space efficient.
$endgroup$
$begingroup$
If you want space efficiency, store the hashcodes. Also note that you only need to analyze N-1 buckets.
$endgroup$
– Mooing Duck
1 hour ago
add a comment |
$begingroup$
As long as all the observations are unique, then K-nearest neighbors with K set to 1 and with any arbitrary valid distance metric will give a classifier which perfectly fits the training set (since the nearest neighbor of every point in the training set is trivially, itself). And it's probably the most efficient since no training at all is needed.
Is that the most efficient way to encode the Boundary? Probably right?
Since we don't know if the data is entirely random or not, using the
data itself as the encoded model with KNN algorithm is probably the
best you can generally do. Right?
It's the most time-efficient, but not necessarily the most space efficient.
$endgroup$
As long as all the observations are unique, then K-nearest neighbors with K set to 1 and with any arbitrary valid distance metric will give a classifier which perfectly fits the training set (since the nearest neighbor of every point in the training set is trivially, itself). And it's probably the most efficient since no training at all is needed.
Is that the most efficient way to encode the Boundary? Probably right?
Since we don't know if the data is entirely random or not, using the
data itself as the encoded model with KNN algorithm is probably the
best you can generally do. Right?
It's the most time-efficient, but not necessarily the most space efficient.
answered 3 hours ago
shimaoshimao
10k11636
10k11636
$begingroup$
If you want space efficiency, store the hashcodes. Also note that you only need to analyze N-1 buckets.
$endgroup$
– Mooing Duck
1 hour ago
add a comment |
$begingroup$
If you want space efficiency, store the hashcodes. Also note that you only need to analyze N-1 buckets.
$endgroup$
– Mooing Duck
1 hour ago
$begingroup$
If you want space efficiency, store the hashcodes. Also note that you only need to analyze N-1 buckets.
$endgroup$
– Mooing Duck
1 hour ago
$begingroup$
If you want space efficiency, store the hashcodes. Also note that you only need to analyze N-1 buckets.
$endgroup$
– Mooing Duck
1 hour ago
add a comment |
$begingroup$
You can't.
At least not in general, to the degree you want, if you want a perfect fit with arbitrary data and arbitrary dimensionality.
As an example, suppose we have $n_1=0$ predictor dimensions (i.e., none at all) and $n_2=2$ observations classified into $n_3=2$ buckets. The two observations are classified into two different buckets, namely "chocolate" and "vanilla".
Since you don't have any predictors, you will not be able to classify them perfectly, period.
If you have at least one predictor that takes different values on each observation, then you can indeed overfit arbitrarily badly, simply by using arbitrarily high polynomial orders for a numerical predictor (if the predictor is categorical with different values on each observation, you don't even need to transform). The tool or model is pretty much secondary. Yes, it's easy to overfit.
Here is an example. The 10 observations are completely independent of the single numerical predictor. We fit increasingly complex logistical regressions or powers of the predictor and classify using a threshold of 0.5 (which is not good practice). Correctly fitted points are marked in green, incorrectly fitted ones in red.

R code:
nn <- 10
set.seed(2)
predictor <- runif(nn)
outcome <- runif(nn)>0.5
plot(predictor,outcome,pch=19,yaxt="n",ylim=c(-0.1,1.6))
axis(2,c(0,1),c("FALSE","TRUE"))
orders <- c(1,2,3,5,7,9)
xx <- seq(min(predictor),max(predictor),0.01)
par(mfrow=c(3,2))
for ( kk in seq_along(orders) ) {
plot(predictor,outcome,pch=19,yaxt="n",ylim=c(-0.2,1.2),main=paste("Order:",orders[kk]))
axis(2,c(0,1),c("FALSE","TRUE"))
model <- glm(outcome~poly(predictor,orders[kk]),family="binomial")
fits_obs <- predict(model,type="response")
fits <- predict(model,newdata=data.frame(predictor=xx),type="response")
lines(xx,fits)
correct <- (fits_obs>0.5 & outcome) | ( fits_obs<0.5 & !outcome)
points(predictor[correct],outcome[correct],cex=1.4,col="green",pch="o")
points(predictor[!correct],outcome[!correct],cex=1.4,col="red",pch="o")
}
$endgroup$
add a comment |
$begingroup$
You can't.
At least not in general, to the degree you want, if you want a perfect fit with arbitrary data and arbitrary dimensionality.
As an example, suppose we have $n_1=0$ predictor dimensions (i.e., none at all) and $n_2=2$ observations classified into $n_3=2$ buckets. The two observations are classified into two different buckets, namely "chocolate" and "vanilla".
Since you don't have any predictors, you will not be able to classify them perfectly, period.
If you have at least one predictor that takes different values on each observation, then you can indeed overfit arbitrarily badly, simply by using arbitrarily high polynomial orders for a numerical predictor (if the predictor is categorical with different values on each observation, you don't even need to transform). The tool or model is pretty much secondary. Yes, it's easy to overfit.
Here is an example. The 10 observations are completely independent of the single numerical predictor. We fit increasingly complex logistical regressions or powers of the predictor and classify using a threshold of 0.5 (which is not good practice). Correctly fitted points are marked in green, incorrectly fitted ones in red.

R code:
nn <- 10
set.seed(2)
predictor <- runif(nn)
outcome <- runif(nn)>0.5
plot(predictor,outcome,pch=19,yaxt="n",ylim=c(-0.1,1.6))
axis(2,c(0,1),c("FALSE","TRUE"))
orders <- c(1,2,3,5,7,9)
xx <- seq(min(predictor),max(predictor),0.01)
par(mfrow=c(3,2))
for ( kk in seq_along(orders) ) {
plot(predictor,outcome,pch=19,yaxt="n",ylim=c(-0.2,1.2),main=paste("Order:",orders[kk]))
axis(2,c(0,1),c("FALSE","TRUE"))
model <- glm(outcome~poly(predictor,orders[kk]),family="binomial")
fits_obs <- predict(model,type="response")
fits <- predict(model,newdata=data.frame(predictor=xx),type="response")
lines(xx,fits)
correct <- (fits_obs>0.5 & outcome) | ( fits_obs<0.5 & !outcome)
points(predictor[correct],outcome[correct],cex=1.4,col="green",pch="o")
points(predictor[!correct],outcome[!correct],cex=1.4,col="red",pch="o")
}
$endgroup$
add a comment |
$begingroup$
You can't.
At least not in general, to the degree you want, if you want a perfect fit with arbitrary data and arbitrary dimensionality.
As an example, suppose we have $n_1=0$ predictor dimensions (i.e., none at all) and $n_2=2$ observations classified into $n_3=2$ buckets. The two observations are classified into two different buckets, namely "chocolate" and "vanilla".
Since you don't have any predictors, you will not be able to classify them perfectly, period.
If you have at least one predictor that takes different values on each observation, then you can indeed overfit arbitrarily badly, simply by using arbitrarily high polynomial orders for a numerical predictor (if the predictor is categorical with different values on each observation, you don't even need to transform). The tool or model is pretty much secondary. Yes, it's easy to overfit.
Here is an example. The 10 observations are completely independent of the single numerical predictor. We fit increasingly complex logistical regressions or powers of the predictor and classify using a threshold of 0.5 (which is not good practice). Correctly fitted points are marked in green, incorrectly fitted ones in red.

R code:
nn <- 10
set.seed(2)
predictor <- runif(nn)
outcome <- runif(nn)>0.5
plot(predictor,outcome,pch=19,yaxt="n",ylim=c(-0.1,1.6))
axis(2,c(0,1),c("FALSE","TRUE"))
orders <- c(1,2,3,5,7,9)
xx <- seq(min(predictor),max(predictor),0.01)
par(mfrow=c(3,2))
for ( kk in seq_along(orders) ) {
plot(predictor,outcome,pch=19,yaxt="n",ylim=c(-0.2,1.2),main=paste("Order:",orders[kk]))
axis(2,c(0,1),c("FALSE","TRUE"))
model <- glm(outcome~poly(predictor,orders[kk]),family="binomial")
fits_obs <- predict(model,type="response")
fits <- predict(model,newdata=data.frame(predictor=xx),type="response")
lines(xx,fits)
correct <- (fits_obs>0.5 & outcome) | ( fits_obs<0.5 & !outcome)
points(predictor[correct],outcome[correct],cex=1.4,col="green",pch="o")
points(predictor[!correct],outcome[!correct],cex=1.4,col="red",pch="o")
}
$endgroup$
You can't.
At least not in general, to the degree you want, if you want a perfect fit with arbitrary data and arbitrary dimensionality.
As an example, suppose we have $n_1=0$ predictor dimensions (i.e., none at all) and $n_2=2$ observations classified into $n_3=2$ buckets. The two observations are classified into two different buckets, namely "chocolate" and "vanilla".
Since you don't have any predictors, you will not be able to classify them perfectly, period.
If you have at least one predictor that takes different values on each observation, then you can indeed overfit arbitrarily badly, simply by using arbitrarily high polynomial orders for a numerical predictor (if the predictor is categorical with different values on each observation, you don't even need to transform). The tool or model is pretty much secondary. Yes, it's easy to overfit.
Here is an example. The 10 observations are completely independent of the single numerical predictor. We fit increasingly complex logistical regressions or powers of the predictor and classify using a threshold of 0.5 (which is not good practice). Correctly fitted points are marked in green, incorrectly fitted ones in red.

R code:
nn <- 10
set.seed(2)
predictor <- runif(nn)
outcome <- runif(nn)>0.5
plot(predictor,outcome,pch=19,yaxt="n",ylim=c(-0.1,1.6))
axis(2,c(0,1),c("FALSE","TRUE"))
orders <- c(1,2,3,5,7,9)
xx <- seq(min(predictor),max(predictor),0.01)
par(mfrow=c(3,2))
for ( kk in seq_along(orders) ) {
plot(predictor,outcome,pch=19,yaxt="n",ylim=c(-0.2,1.2),main=paste("Order:",orders[kk]))
axis(2,c(0,1),c("FALSE","TRUE"))
model <- glm(outcome~poly(predictor,orders[kk]),family="binomial")
fits_obs <- predict(model,type="response")
fits <- predict(model,newdata=data.frame(predictor=xx),type="response")
lines(xx,fits)
correct <- (fits_obs>0.5 & outcome) | ( fits_obs<0.5 & !outcome)
points(predictor[correct],outcome[correct],cex=1.4,col="green",pch="o")
points(predictor[!correct],outcome[!correct],cex=1.4,col="red",pch="o")
}
answered 4 hours ago
Stephan KolassaStephan Kolassa
49.1k8103186
49.1k8103186
add a comment |
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f406122%2fhow-do-i-overfit%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
$begingroup$
knn with $k=1$?
$endgroup$
– shimao
5 hours ago
$begingroup$
@shimao I suppose that would work, wouldn't it? Yeah, I can't see why not. Thank you very much!
$endgroup$
– Legit Stack
5 hours ago
$begingroup$
@shimao Is that the most efficient way to encode the Boundary? Probably right? Since we don't know if the data is entirely random or not, using the data itself as the encoded model with KNN algorithm is probably the best you can generally do. Right?
$endgroup$
– Legit Stack
5 hours ago
2
$begingroup$
@shimao: do you want to post your comment as an answer (perhaps with a little more detail)?
$endgroup$
– Stephan Kolassa
4 hours ago