How can I visualize an ordinal variable predicting a continuous outcome?Treating ordinal variables as...
Are devices supposed to automatically be removed from iCloud when all content and settings are erased?
Why isn't there armor to protect from spells in the Potterverse?
Impersonating user with Core Service App and Angular client
Phonetic distortion when words are borrowed among languages
Smallest PRIME containing the first 11 primes as sub-strings
Why does `FindFit` fail so badly in this simple case?
Is it possible for a company to grow but its stock price stays the same or decrease?
Windows 10 deletes lots of tiny files super slowly. Anything that can be done to speed it up?
Would an object shot from earth fall into the sun?
Knights and Knaves: What does C say?
How important is knowledge of trig identities for use in Calculus
What would happen if I build a half bath without permits?
Top off gas with old oil, is that bad?
Can adverbs modify adjectives?
How is the Apple Watch ECG disabled in certain countries?
GPLv3 forces us to make code available, but to who?
Lost passport and visa, tried to reapply, got rejected twice. What are my next steps?
Is there any site with telescopes data?
Science fiction episode about the creation of a living pegasus, even though flightless
How to identify whether a publisher is genuine or not?
How do we know neutrons have no charge?
Why does it seem the best way to make a living is to invest in real estate?
Create the same subfolders in another folder
Convert a string of digits from words to an integer
How can I visualize an ordinal variable predicting a continuous outcome?
Treating ordinal variables as continuous for regression problemsHow to visualize (make plot) of regression output against categorical input variable?plot predicted values from a cumulative link model (clm, ordinal)Ordinal vs. Continuous Variable and Appropriate Method for Testing Difference of GroupsR Mediate: How to interpret output with an ordinal outcome?continuous independent variable with three levels
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}
$begingroup$
What is the best way to visualize the relationship between an ordinal predictor and a continuous outcome?
So far I have the below, but I feel like this is lacking...
The way I modeled it is I treated the ordinal predictor as an interval instead of categorical. If this is not the best way to treat this type of data, I'd appreciate the feedback.
data-visualization ordinal-data
$endgroup$
add a comment
|
$begingroup$
What is the best way to visualize the relationship between an ordinal predictor and a continuous outcome?
So far I have the below, but I feel like this is lacking...
The way I modeled it is I treated the ordinal predictor as an interval instead of categorical. If this is not the best way to treat this type of data, I'd appreciate the feedback.
data-visualization ordinal-data
$endgroup$
4
$begingroup$
1. Is there any particular reason to imagine the relationship would be linear? 2. Is there indeed any need to draw any kind of curve or line? Why not simply mark in the means (or any other suitable measure of location) in each category? 3. Can you say more about the continuous outcome? What kind of thing is it?
$endgroup$
– Glen_b♦
17 hours ago
add a comment
|
$begingroup$
What is the best way to visualize the relationship between an ordinal predictor and a continuous outcome?
So far I have the below, but I feel like this is lacking...
The way I modeled it is I treated the ordinal predictor as an interval instead of categorical. If this is not the best way to treat this type of data, I'd appreciate the feedback.
data-visualization ordinal-data
$endgroup$
What is the best way to visualize the relationship between an ordinal predictor and a continuous outcome?
So far I have the below, but I feel like this is lacking...
The way I modeled it is I treated the ordinal predictor as an interval instead of categorical. If this is not the best way to treat this type of data, I'd appreciate the feedback.
data-visualization ordinal-data
data-visualization ordinal-data
edited 36 mins ago
Peter Mortensen
2052 silver badges8 bronze badges
2052 silver badges8 bronze badges
asked 19 hours ago
762762
527 bronze badges
527 bronze badges
4
$begingroup$
1. Is there any particular reason to imagine the relationship would be linear? 2. Is there indeed any need to draw any kind of curve or line? Why not simply mark in the means (or any other suitable measure of location) in each category? 3. Can you say more about the continuous outcome? What kind of thing is it?
$endgroup$
– Glen_b♦
17 hours ago
add a comment
|
4
$begingroup$
1. Is there any particular reason to imagine the relationship would be linear? 2. Is there indeed any need to draw any kind of curve or line? Why not simply mark in the means (or any other suitable measure of location) in each category? 3. Can you say more about the continuous outcome? What kind of thing is it?
$endgroup$
– Glen_b♦
17 hours ago
4
4
$begingroup$
1. Is there any particular reason to imagine the relationship would be linear? 2. Is there indeed any need to draw any kind of curve or line? Why not simply mark in the means (or any other suitable measure of location) in each category? 3. Can you say more about the continuous outcome? What kind of thing is it?
$endgroup$
– Glen_b♦
17 hours ago
$begingroup$
1. Is there any particular reason to imagine the relationship would be linear? 2. Is there indeed any need to draw any kind of curve or line? Why not simply mark in the means (or any other suitable measure of location) in each category? 3. Can you say more about the continuous outcome? What kind of thing is it?
$endgroup$
– Glen_b♦
17 hours ago
add a comment
|
6 Answers
6
active
oldest
votes
$begingroup$
The problem with this is that there's no way of knowing how many dots are bunched up together. Two solutions I've seen:
Box plot
This would give you tighter box if data points are bunched up together.
Bubble chart
Not sure if this is the official name, but basically you put the vertical axis into bins. The size of the bubble is determined by how many observations fall into that bin.
$endgroup$
$begingroup$
How does the bubble chart help to display the ordinal variable? Maybe you can change your example image to one with ordinal data.
$endgroup$
– Pieter
15 hours ago
$begingroup$
@Pieter Thanks for the suggestion but I couldn't find one... but you can see the variable on the x-axis are all integers (11, 12, 13, 14, ...) so that's one example of discrete data, and you can treat ordinal values as discrete data.
$endgroup$
– Art
15 hours ago
add a comment
|
$begingroup$
The plot you shown is pretty good. But I think you can improve the data-ink ratio (invented by Edward Tufte) even more by showing all the datapoints. You can do this by adding jitter to the x-axis.
Another improvement is to emphasise that the ordinal variable is categorical and not continuous. You can do this by using a different colour for the different levels.
As an example I have plotted the titanic dataset in R, using the passenger class as an ordinal variable and the passenger age as the continuous variable.
library(tidyverse)
library(ggplot2)
library(titanic)
df <- titanic_train %>% mutate(Class=factor(Pclass))
ggplot(df, aes(Class, Age, color=Class)) +
geom_jitter(height = 0) +
ggtitle("Titanic passenger age vs. class")
$endgroup$
add a comment
|
$begingroup$
To your scatterplot, I would add a large point indicating the mean Y-value at every unique X-value, and also do one or more of the following:
- Square-root (or cube-root) transform your Y-axis. Both these
transformations can deal with zeroes, unlike log transformations.
Cube roots can also deal with negative numbers. - Make the points a bit transparent.
- Add a little jitter to the X-axis values if the previous steps are insufficient.
As Glen_b notes, there is insufficient information right now to note whether adding a linear regression line is meaningful.
$endgroup$
$begingroup$
Are there any zeros in the response?
$endgroup$
– Nick Cox
17 hours ago
$begingroup$
@NickCox Hard to say for sure. I squinted at it and it seemed like there might be. See X = 3, for example.
$endgroup$
– mkt
17 hours ago
$begingroup$
Question remains for OP, who should know. Incidentally I am a small fan of cube roots, used in Miles, Stokes, Vieli, Cox in Nature. We had to work hard to persuade reviewers that they were a good idea for a response that was variously positive and negative. But for the question here I would favour a Poisson model, which can work fine for non-negative continuous responses.
$endgroup$
– Nick Cox
16 hours ago
$begingroup$
@NickCox Agreed, but the question was about how to visualise, not how to model.
$endgroup$
– mkt
16 hours ago
1
$begingroup$
Indeed. That was a comment not an answer. But a Poisson model would imply plotting on log scale with a secondary question on how to plot observed zeros.
$endgroup$
– Nick Cox
15 hours ago
add a comment
|
$begingroup$
You state that one variable is ordinal, then you decide to treat it as interval. Is that reasonable? There is no way for us to know, as you have not said what the ordinal variable actually is. If you do decide to keep it as ordinal, then what to do depends on your sample size. If N is very large then I like the box plot solution. If N is not so large, then I like jitter. There are other additions you can make to the scatterplot as well - I wrote a presentation about this using SAS, but I am sure it could be duplicated in R. (If that link does not work, Googling flom, scatterplots, enhancements should find it).
But what if treating the variable as interval is not reasonable? You could come to this conclusion either substantively or by trying different codings and seeing how results change. In that case, I suggest trying optimal scaling. There is an R package optiscale that may help (I have not used this package).
$endgroup$
add a comment
|
$begingroup$
The basic idea of regression is that the probability distribution of $y$ depends on $x$: there is some family of distributions $P_x(y)$. It's generally assumed that these distributions are all normal with a constant standard deviation (homoscedasticity), leaving only the mean as depending on $x$: $p(Y=y) = N(mu_x,sigma)$. With continuous data, you typically get only one $y$ for finitely many $x$, and no $y$ for the rest, making estimating $mu_x$ by just looking at your sample $y$ for that $x$ unworkable. So a further assumption is often made that $u_x$ is a simple linear function of $x$, so that $p(Y=y) = N(mx+b,sigma)$ for some numbers $m, b, sigma$. The linear regression formula then gives you an estimate of $m$ (slope) and $b$ (intercept) for your data.
Here, you seem to have highly skewed data, and there seems to be a general trend of decreasing spread, so if you were to use linear regression, the normality and homoscedasticity assumptions would be problematic. But you appear to have a large dataset for each value of $x$. So to estimate $mu$ for a particular $x$, there is no need to use the linear regression formula; you can simply take $bar y$ for each $x$. Which is more informative for predicting a $y$ for $x=4$: looking at the $y$ values for $x=4$, or looking at the $y$ values for $x=3$ and $x=5$, and trying interpolate between them?
You may want to show summary statistics other than just $bar y_x$. A box plot can show meadian and quartiles, for instance. You might also want to represent the standard deviation somehow.
You could also show the entire distributions. You could do that with x-dither, as Pieter suggested, or with another type of chart, such as density plots. You could put them side-by-side as in Pieter's answer, but with only six categories, it might be possible to combine them into one chart with the categories separated by colors. Here's a discussion of histograms and density plots: https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0
$endgroup$
add a comment
|
$begingroup$
In addition to the box plot suggested by Art, I suggest a violin plot:
Explicitly showing the median and interquartile range, as done in the above image, is optional.
Quoting from Wikipedia:
Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator.
A violin plot is more informative than a plain box plot. While a box plot only shows summary statistics such as mean/median and interquartile ranges, the violin plot shows the full distribution of the data. The difference is particularly useful when the data distribution is multimodal (more than one peak).
$endgroup$
add a comment
|
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f428445%2fhow-can-i-visualize-an-ordinal-variable-predicting-a-continuous-outcome%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
The problem with this is that there's no way of knowing how many dots are bunched up together. Two solutions I've seen:
Box plot
This would give you tighter box if data points are bunched up together.
Bubble chart
Not sure if this is the official name, but basically you put the vertical axis into bins. The size of the bubble is determined by how many observations fall into that bin.
$endgroup$
$begingroup$
How does the bubble chart help to display the ordinal variable? Maybe you can change your example image to one with ordinal data.
$endgroup$
– Pieter
15 hours ago
$begingroup$
@Pieter Thanks for the suggestion but I couldn't find one... but you can see the variable on the x-axis are all integers (11, 12, 13, 14, ...) so that's one example of discrete data, and you can treat ordinal values as discrete data.
$endgroup$
– Art
15 hours ago
add a comment
|
$begingroup$
The problem with this is that there's no way of knowing how many dots are bunched up together. Two solutions I've seen:
Box plot
This would give you tighter box if data points are bunched up together.
Bubble chart
Not sure if this is the official name, but basically you put the vertical axis into bins. The size of the bubble is determined by how many observations fall into that bin.
$endgroup$
$begingroup$
How does the bubble chart help to display the ordinal variable? Maybe you can change your example image to one with ordinal data.
$endgroup$
– Pieter
15 hours ago
$begingroup$
@Pieter Thanks for the suggestion but I couldn't find one... but you can see the variable on the x-axis are all integers (11, 12, 13, 14, ...) so that's one example of discrete data, and you can treat ordinal values as discrete data.
$endgroup$
– Art
15 hours ago
add a comment
|
$begingroup$
The problem with this is that there's no way of knowing how many dots are bunched up together. Two solutions I've seen:
Box plot
This would give you tighter box if data points are bunched up together.
Bubble chart
Not sure if this is the official name, but basically you put the vertical axis into bins. The size of the bubble is determined by how many observations fall into that bin.
$endgroup$
The problem with this is that there's no way of knowing how many dots are bunched up together. Two solutions I've seen:
Box plot
This would give you tighter box if data points are bunched up together.
Bubble chart
Not sure if this is the official name, but basically you put the vertical axis into bins. The size of the bubble is determined by how many observations fall into that bin.
answered 18 hours ago
ArtArt
3351 silver badge7 bronze badges
3351 silver badge7 bronze badges
$begingroup$
How does the bubble chart help to display the ordinal variable? Maybe you can change your example image to one with ordinal data.
$endgroup$
– Pieter
15 hours ago
$begingroup$
@Pieter Thanks for the suggestion but I couldn't find one... but you can see the variable on the x-axis are all integers (11, 12, 13, 14, ...) so that's one example of discrete data, and you can treat ordinal values as discrete data.
$endgroup$
– Art
15 hours ago
add a comment
|
$begingroup$
How does the bubble chart help to display the ordinal variable? Maybe you can change your example image to one with ordinal data.
$endgroup$
– Pieter
15 hours ago
$begingroup$
@Pieter Thanks for the suggestion but I couldn't find one... but you can see the variable on the x-axis are all integers (11, 12, 13, 14, ...) so that's one example of discrete data, and you can treat ordinal values as discrete data.
$endgroup$
– Art
15 hours ago
$begingroup$
How does the bubble chart help to display the ordinal variable? Maybe you can change your example image to one with ordinal data.
$endgroup$
– Pieter
15 hours ago
$begingroup$
How does the bubble chart help to display the ordinal variable? Maybe you can change your example image to one with ordinal data.
$endgroup$
– Pieter
15 hours ago
$begingroup$
@Pieter Thanks for the suggestion but I couldn't find one... but you can see the variable on the x-axis are all integers (11, 12, 13, 14, ...) so that's one example of discrete data, and you can treat ordinal values as discrete data.
$endgroup$
– Art
15 hours ago
$begingroup$
@Pieter Thanks for the suggestion but I couldn't find one... but you can see the variable on the x-axis are all integers (11, 12, 13, 14, ...) so that's one example of discrete data, and you can treat ordinal values as discrete data.
$endgroup$
– Art
15 hours ago
add a comment
|
$begingroup$
The plot you shown is pretty good. But I think you can improve the data-ink ratio (invented by Edward Tufte) even more by showing all the datapoints. You can do this by adding jitter to the x-axis.
Another improvement is to emphasise that the ordinal variable is categorical and not continuous. You can do this by using a different colour for the different levels.
As an example I have plotted the titanic dataset in R, using the passenger class as an ordinal variable and the passenger age as the continuous variable.
library(tidyverse)
library(ggplot2)
library(titanic)
df <- titanic_train %>% mutate(Class=factor(Pclass))
ggplot(df, aes(Class, Age, color=Class)) +
geom_jitter(height = 0) +
ggtitle("Titanic passenger age vs. class")
$endgroup$
add a comment
|
$begingroup$
The plot you shown is pretty good. But I think you can improve the data-ink ratio (invented by Edward Tufte) even more by showing all the datapoints. You can do this by adding jitter to the x-axis.
Another improvement is to emphasise that the ordinal variable is categorical and not continuous. You can do this by using a different colour for the different levels.
As an example I have plotted the titanic dataset in R, using the passenger class as an ordinal variable and the passenger age as the continuous variable.
library(tidyverse)
library(ggplot2)
library(titanic)
df <- titanic_train %>% mutate(Class=factor(Pclass))
ggplot(df, aes(Class, Age, color=Class)) +
geom_jitter(height = 0) +
ggtitle("Titanic passenger age vs. class")
$endgroup$
add a comment
|
$begingroup$
The plot you shown is pretty good. But I think you can improve the data-ink ratio (invented by Edward Tufte) even more by showing all the datapoints. You can do this by adding jitter to the x-axis.
Another improvement is to emphasise that the ordinal variable is categorical and not continuous. You can do this by using a different colour for the different levels.
As an example I have plotted the titanic dataset in R, using the passenger class as an ordinal variable and the passenger age as the continuous variable.
library(tidyverse)
library(ggplot2)
library(titanic)
df <- titanic_train %>% mutate(Class=factor(Pclass))
ggplot(df, aes(Class, Age, color=Class)) +
geom_jitter(height = 0) +
ggtitle("Titanic passenger age vs. class")
$endgroup$
The plot you shown is pretty good. But I think you can improve the data-ink ratio (invented by Edward Tufte) even more by showing all the datapoints. You can do this by adding jitter to the x-axis.
Another improvement is to emphasise that the ordinal variable is categorical and not continuous. You can do this by using a different colour for the different levels.
As an example I have plotted the titanic dataset in R, using the passenger class as an ordinal variable and the passenger age as the continuous variable.
library(tidyverse)
library(ggplot2)
library(titanic)
df <- titanic_train %>% mutate(Class=factor(Pclass))
ggplot(df, aes(Class, Age, color=Class)) +
geom_jitter(height = 0) +
ggtitle("Titanic passenger age vs. class")
answered 15 hours ago
PieterPieter
1,4976 silver badges19 bronze badges
1,4976 silver badges19 bronze badges
add a comment
|
add a comment
|
$begingroup$
To your scatterplot, I would add a large point indicating the mean Y-value at every unique X-value, and also do one or more of the following:
- Square-root (or cube-root) transform your Y-axis. Both these
transformations can deal with zeroes, unlike log transformations.
Cube roots can also deal with negative numbers. - Make the points a bit transparent.
- Add a little jitter to the X-axis values if the previous steps are insufficient.
As Glen_b notes, there is insufficient information right now to note whether adding a linear regression line is meaningful.
$endgroup$
$begingroup$
Are there any zeros in the response?
$endgroup$
– Nick Cox
17 hours ago
$begingroup$
@NickCox Hard to say for sure. I squinted at it and it seemed like there might be. See X = 3, for example.
$endgroup$
– mkt
17 hours ago
$begingroup$
Question remains for OP, who should know. Incidentally I am a small fan of cube roots, used in Miles, Stokes, Vieli, Cox in Nature. We had to work hard to persuade reviewers that they were a good idea for a response that was variously positive and negative. But for the question here I would favour a Poisson model, which can work fine for non-negative continuous responses.
$endgroup$
– Nick Cox
16 hours ago
$begingroup$
@NickCox Agreed, but the question was about how to visualise, not how to model.
$endgroup$
– mkt
16 hours ago
1
$begingroup$
Indeed. That was a comment not an answer. But a Poisson model would imply plotting on log scale with a secondary question on how to plot observed zeros.
$endgroup$
– Nick Cox
15 hours ago
add a comment
|
$begingroup$
To your scatterplot, I would add a large point indicating the mean Y-value at every unique X-value, and also do one or more of the following:
- Square-root (or cube-root) transform your Y-axis. Both these
transformations can deal with zeroes, unlike log transformations.
Cube roots can also deal with negative numbers. - Make the points a bit transparent.
- Add a little jitter to the X-axis values if the previous steps are insufficient.
As Glen_b notes, there is insufficient information right now to note whether adding a linear regression line is meaningful.
$endgroup$
$begingroup$
Are there any zeros in the response?
$endgroup$
– Nick Cox
17 hours ago
$begingroup$
@NickCox Hard to say for sure. I squinted at it and it seemed like there might be. See X = 3, for example.
$endgroup$
– mkt
17 hours ago
$begingroup$
Question remains for OP, who should know. Incidentally I am a small fan of cube roots, used in Miles, Stokes, Vieli, Cox in Nature. We had to work hard to persuade reviewers that they were a good idea for a response that was variously positive and negative. But for the question here I would favour a Poisson model, which can work fine for non-negative continuous responses.
$endgroup$
– Nick Cox
16 hours ago
$begingroup$
@NickCox Agreed, but the question was about how to visualise, not how to model.
$endgroup$
– mkt
16 hours ago
1
$begingroup$
Indeed. That was a comment not an answer. But a Poisson model would imply plotting on log scale with a secondary question on how to plot observed zeros.
$endgroup$
– Nick Cox
15 hours ago
add a comment
|
$begingroup$
To your scatterplot, I would add a large point indicating the mean Y-value at every unique X-value, and also do one or more of the following:
- Square-root (or cube-root) transform your Y-axis. Both these
transformations can deal with zeroes, unlike log transformations.
Cube roots can also deal with negative numbers. - Make the points a bit transparent.
- Add a little jitter to the X-axis values if the previous steps are insufficient.
As Glen_b notes, there is insufficient information right now to note whether adding a linear regression line is meaningful.
$endgroup$
To your scatterplot, I would add a large point indicating the mean Y-value at every unique X-value, and also do one or more of the following:
- Square-root (or cube-root) transform your Y-axis. Both these
transformations can deal with zeroes, unlike log transformations.
Cube roots can also deal with negative numbers. - Make the points a bit transparent.
- Add a little jitter to the X-axis values if the previous steps are insufficient.
As Glen_b notes, there is insufficient information right now to note whether adding a linear regression line is meaningful.
edited 17 hours ago
answered 18 hours ago
mktmkt
7,4536 gold badges31 silver badges89 bronze badges
7,4536 gold badges31 silver badges89 bronze badges
$begingroup$
Are there any zeros in the response?
$endgroup$
– Nick Cox
17 hours ago
$begingroup$
@NickCox Hard to say for sure. I squinted at it and it seemed like there might be. See X = 3, for example.
$endgroup$
– mkt
17 hours ago
$begingroup$
Question remains for OP, who should know. Incidentally I am a small fan of cube roots, used in Miles, Stokes, Vieli, Cox in Nature. We had to work hard to persuade reviewers that they were a good idea for a response that was variously positive and negative. But for the question here I would favour a Poisson model, which can work fine for non-negative continuous responses.
$endgroup$
– Nick Cox
16 hours ago
$begingroup$
@NickCox Agreed, but the question was about how to visualise, not how to model.
$endgroup$
– mkt
16 hours ago
1
$begingroup$
Indeed. That was a comment not an answer. But a Poisson model would imply plotting on log scale with a secondary question on how to plot observed zeros.
$endgroup$
– Nick Cox
15 hours ago
add a comment
|
$begingroup$
Are there any zeros in the response?
$endgroup$
– Nick Cox
17 hours ago
$begingroup$
@NickCox Hard to say for sure. I squinted at it and it seemed like there might be. See X = 3, for example.
$endgroup$
– mkt
17 hours ago
$begingroup$
Question remains for OP, who should know. Incidentally I am a small fan of cube roots, used in Miles, Stokes, Vieli, Cox in Nature. We had to work hard to persuade reviewers that they were a good idea for a response that was variously positive and negative. But for the question here I would favour a Poisson model, which can work fine for non-negative continuous responses.
$endgroup$
– Nick Cox
16 hours ago
$begingroup$
@NickCox Agreed, but the question was about how to visualise, not how to model.
$endgroup$
– mkt
16 hours ago
1
$begingroup$
Indeed. That was a comment not an answer. But a Poisson model would imply plotting on log scale with a secondary question on how to plot observed zeros.
$endgroup$
– Nick Cox
15 hours ago
$begingroup$
Are there any zeros in the response?
$endgroup$
– Nick Cox
17 hours ago
$begingroup$
Are there any zeros in the response?
$endgroup$
– Nick Cox
17 hours ago
$begingroup$
@NickCox Hard to say for sure. I squinted at it and it seemed like there might be. See X = 3, for example.
$endgroup$
– mkt
17 hours ago
$begingroup$
@NickCox Hard to say for sure. I squinted at it and it seemed like there might be. See X = 3, for example.
$endgroup$
– mkt
17 hours ago
$begingroup$
Question remains for OP, who should know. Incidentally I am a small fan of cube roots, used in Miles, Stokes, Vieli, Cox in Nature. We had to work hard to persuade reviewers that they were a good idea for a response that was variously positive and negative. But for the question here I would favour a Poisson model, which can work fine for non-negative continuous responses.
$endgroup$
– Nick Cox
16 hours ago
$begingroup$
Question remains for OP, who should know. Incidentally I am a small fan of cube roots, used in Miles, Stokes, Vieli, Cox in Nature. We had to work hard to persuade reviewers that they were a good idea for a response that was variously positive and negative. But for the question here I would favour a Poisson model, which can work fine for non-negative continuous responses.
$endgroup$
– Nick Cox
16 hours ago
$begingroup$
@NickCox Agreed, but the question was about how to visualise, not how to model.
$endgroup$
– mkt
16 hours ago
$begingroup$
@NickCox Agreed, but the question was about how to visualise, not how to model.
$endgroup$
– mkt
16 hours ago
1
1
$begingroup$
Indeed. That was a comment not an answer. But a Poisson model would imply plotting on log scale with a secondary question on how to plot observed zeros.
$endgroup$
– Nick Cox
15 hours ago
$begingroup$
Indeed. That was a comment not an answer. But a Poisson model would imply plotting on log scale with a secondary question on how to plot observed zeros.
$endgroup$
– Nick Cox
15 hours ago
add a comment
|
$begingroup$
You state that one variable is ordinal, then you decide to treat it as interval. Is that reasonable? There is no way for us to know, as you have not said what the ordinal variable actually is. If you do decide to keep it as ordinal, then what to do depends on your sample size. If N is very large then I like the box plot solution. If N is not so large, then I like jitter. There are other additions you can make to the scatterplot as well - I wrote a presentation about this using SAS, but I am sure it could be duplicated in R. (If that link does not work, Googling flom, scatterplots, enhancements should find it).
But what if treating the variable as interval is not reasonable? You could come to this conclusion either substantively or by trying different codings and seeing how results change. In that case, I suggest trying optimal scaling. There is an R package optiscale that may help (I have not used this package).
$endgroup$
add a comment
|
$begingroup$
You state that one variable is ordinal, then you decide to treat it as interval. Is that reasonable? There is no way for us to know, as you have not said what the ordinal variable actually is. If you do decide to keep it as ordinal, then what to do depends on your sample size. If N is very large then I like the box plot solution. If N is not so large, then I like jitter. There are other additions you can make to the scatterplot as well - I wrote a presentation about this using SAS, but I am sure it could be duplicated in R. (If that link does not work, Googling flom, scatterplots, enhancements should find it).
But what if treating the variable as interval is not reasonable? You could come to this conclusion either substantively or by trying different codings and seeing how results change. In that case, I suggest trying optimal scaling. There is an R package optiscale that may help (I have not used this package).
$endgroup$
add a comment
|
$begingroup$
You state that one variable is ordinal, then you decide to treat it as interval. Is that reasonable? There is no way for us to know, as you have not said what the ordinal variable actually is. If you do decide to keep it as ordinal, then what to do depends on your sample size. If N is very large then I like the box plot solution. If N is not so large, then I like jitter. There are other additions you can make to the scatterplot as well - I wrote a presentation about this using SAS, but I am sure it could be duplicated in R. (If that link does not work, Googling flom, scatterplots, enhancements should find it).
But what if treating the variable as interval is not reasonable? You could come to this conclusion either substantively or by trying different codings and seeing how results change. In that case, I suggest trying optimal scaling. There is an R package optiscale that may help (I have not used this package).
$endgroup$
You state that one variable is ordinal, then you decide to treat it as interval. Is that reasonable? There is no way for us to know, as you have not said what the ordinal variable actually is. If you do decide to keep it as ordinal, then what to do depends on your sample size. If N is very large then I like the box plot solution. If N is not so large, then I like jitter. There are other additions you can make to the scatterplot as well - I wrote a presentation about this using SAS, but I am sure it could be duplicated in R. (If that link does not work, Googling flom, scatterplots, enhancements should find it).
But what if treating the variable as interval is not reasonable? You could come to this conclusion either substantively or by trying different codings and seeing how results change. In that case, I suggest trying optimal scaling. There is an R package optiscale that may help (I have not used this package).
answered 15 hours ago
Peter Flom♦Peter Flom
80.7k13 gold badges117 silver badges228 bronze badges
80.7k13 gold badges117 silver badges228 bronze badges
add a comment
|
add a comment
|
$begingroup$
The basic idea of regression is that the probability distribution of $y$ depends on $x$: there is some family of distributions $P_x(y)$. It's generally assumed that these distributions are all normal with a constant standard deviation (homoscedasticity), leaving only the mean as depending on $x$: $p(Y=y) = N(mu_x,sigma)$. With continuous data, you typically get only one $y$ for finitely many $x$, and no $y$ for the rest, making estimating $mu_x$ by just looking at your sample $y$ for that $x$ unworkable. So a further assumption is often made that $u_x$ is a simple linear function of $x$, so that $p(Y=y) = N(mx+b,sigma)$ for some numbers $m, b, sigma$. The linear regression formula then gives you an estimate of $m$ (slope) and $b$ (intercept) for your data.
Here, you seem to have highly skewed data, and there seems to be a general trend of decreasing spread, so if you were to use linear regression, the normality and homoscedasticity assumptions would be problematic. But you appear to have a large dataset for each value of $x$. So to estimate $mu$ for a particular $x$, there is no need to use the linear regression formula; you can simply take $bar y$ for each $x$. Which is more informative for predicting a $y$ for $x=4$: looking at the $y$ values for $x=4$, or looking at the $y$ values for $x=3$ and $x=5$, and trying interpolate between them?
You may want to show summary statistics other than just $bar y_x$. A box plot can show meadian and quartiles, for instance. You might also want to represent the standard deviation somehow.
You could also show the entire distributions. You could do that with x-dither, as Pieter suggested, or with another type of chart, such as density plots. You could put them side-by-side as in Pieter's answer, but with only six categories, it might be possible to combine them into one chart with the categories separated by colors. Here's a discussion of histograms and density plots: https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0
$endgroup$
add a comment
|
$begingroup$
The basic idea of regression is that the probability distribution of $y$ depends on $x$: there is some family of distributions $P_x(y)$. It's generally assumed that these distributions are all normal with a constant standard deviation (homoscedasticity), leaving only the mean as depending on $x$: $p(Y=y) = N(mu_x,sigma)$. With continuous data, you typically get only one $y$ for finitely many $x$, and no $y$ for the rest, making estimating $mu_x$ by just looking at your sample $y$ for that $x$ unworkable. So a further assumption is often made that $u_x$ is a simple linear function of $x$, so that $p(Y=y) = N(mx+b,sigma)$ for some numbers $m, b, sigma$. The linear regression formula then gives you an estimate of $m$ (slope) and $b$ (intercept) for your data.
Here, you seem to have highly skewed data, and there seems to be a general trend of decreasing spread, so if you were to use linear regression, the normality and homoscedasticity assumptions would be problematic. But you appear to have a large dataset for each value of $x$. So to estimate $mu$ for a particular $x$, there is no need to use the linear regression formula; you can simply take $bar y$ for each $x$. Which is more informative for predicting a $y$ for $x=4$: looking at the $y$ values for $x=4$, or looking at the $y$ values for $x=3$ and $x=5$, and trying interpolate between them?
You may want to show summary statistics other than just $bar y_x$. A box plot can show meadian and quartiles, for instance. You might also want to represent the standard deviation somehow.
You could also show the entire distributions. You could do that with x-dither, as Pieter suggested, or with another type of chart, such as density plots. You could put them side-by-side as in Pieter's answer, but with only six categories, it might be possible to combine them into one chart with the categories separated by colors. Here's a discussion of histograms and density plots: https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0
$endgroup$
add a comment
|
$begingroup$
The basic idea of regression is that the probability distribution of $y$ depends on $x$: there is some family of distributions $P_x(y)$. It's generally assumed that these distributions are all normal with a constant standard deviation (homoscedasticity), leaving only the mean as depending on $x$: $p(Y=y) = N(mu_x,sigma)$. With continuous data, you typically get only one $y$ for finitely many $x$, and no $y$ for the rest, making estimating $mu_x$ by just looking at your sample $y$ for that $x$ unworkable. So a further assumption is often made that $u_x$ is a simple linear function of $x$, so that $p(Y=y) = N(mx+b,sigma)$ for some numbers $m, b, sigma$. The linear regression formula then gives you an estimate of $m$ (slope) and $b$ (intercept) for your data.
Here, you seem to have highly skewed data, and there seems to be a general trend of decreasing spread, so if you were to use linear regression, the normality and homoscedasticity assumptions would be problematic. But you appear to have a large dataset for each value of $x$. So to estimate $mu$ for a particular $x$, there is no need to use the linear regression formula; you can simply take $bar y$ for each $x$. Which is more informative for predicting a $y$ for $x=4$: looking at the $y$ values for $x=4$, or looking at the $y$ values for $x=3$ and $x=5$, and trying interpolate between them?
You may want to show summary statistics other than just $bar y_x$. A box plot can show meadian and quartiles, for instance. You might also want to represent the standard deviation somehow.
You could also show the entire distributions. You could do that with x-dither, as Pieter suggested, or with another type of chart, such as density plots. You could put them side-by-side as in Pieter's answer, but with only six categories, it might be possible to combine them into one chart with the categories separated by colors. Here's a discussion of histograms and density plots: https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0
$endgroup$
The basic idea of regression is that the probability distribution of $y$ depends on $x$: there is some family of distributions $P_x(y)$. It's generally assumed that these distributions are all normal with a constant standard deviation (homoscedasticity), leaving only the mean as depending on $x$: $p(Y=y) = N(mu_x,sigma)$. With continuous data, you typically get only one $y$ for finitely many $x$, and no $y$ for the rest, making estimating $mu_x$ by just looking at your sample $y$ for that $x$ unworkable. So a further assumption is often made that $u_x$ is a simple linear function of $x$, so that $p(Y=y) = N(mx+b,sigma)$ for some numbers $m, b, sigma$. The linear regression formula then gives you an estimate of $m$ (slope) and $b$ (intercept) for your data.
Here, you seem to have highly skewed data, and there seems to be a general trend of decreasing spread, so if you were to use linear regression, the normality and homoscedasticity assumptions would be problematic. But you appear to have a large dataset for each value of $x$. So to estimate $mu$ for a particular $x$, there is no need to use the linear regression formula; you can simply take $bar y$ for each $x$. Which is more informative for predicting a $y$ for $x=4$: looking at the $y$ values for $x=4$, or looking at the $y$ values for $x=3$ and $x=5$, and trying interpolate between them?
You may want to show summary statistics other than just $bar y_x$. A box plot can show meadian and quartiles, for instance. You might also want to represent the standard deviation somehow.
You could also show the entire distributions. You could do that with x-dither, as Pieter suggested, or with another type of chart, such as density plots. You could put them side-by-side as in Pieter's answer, but with only six categories, it might be possible to combine them into one chart with the categories separated by colors. Here's a discussion of histograms and density plots: https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0
answered 3 hours ago
AcccumulationAcccumulation
1,8612 silver badges7 bronze badges
1,8612 silver badges7 bronze badges
add a comment
|
add a comment
|
$begingroup$
In addition to the box plot suggested by Art, I suggest a violin plot:
Explicitly showing the median and interquartile range, as done in the above image, is optional.
Quoting from Wikipedia:
Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator.
A violin plot is more informative than a plain box plot. While a box plot only shows summary statistics such as mean/median and interquartile ranges, the violin plot shows the full distribution of the data. The difference is particularly useful when the data distribution is multimodal (more than one peak).
$endgroup$
add a comment
|
$begingroup$
In addition to the box plot suggested by Art, I suggest a violin plot:
Explicitly showing the median and interquartile range, as done in the above image, is optional.
Quoting from Wikipedia:
Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator.
A violin plot is more informative than a plain box plot. While a box plot only shows summary statistics such as mean/median and interquartile ranges, the violin plot shows the full distribution of the data. The difference is particularly useful when the data distribution is multimodal (more than one peak).
$endgroup$
add a comment
|
$begingroup$
In addition to the box plot suggested by Art, I suggest a violin plot:
Explicitly showing the median and interquartile range, as done in the above image, is optional.
Quoting from Wikipedia:
Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator.
A violin plot is more informative than a plain box plot. While a box plot only shows summary statistics such as mean/median and interquartile ranges, the violin plot shows the full distribution of the data. The difference is particularly useful when the data distribution is multimodal (more than one peak).
$endgroup$
In addition to the box plot suggested by Art, I suggest a violin plot:
Explicitly showing the median and interquartile range, as done in the above image, is optional.
Quoting from Wikipedia:
Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator.
A violin plot is more informative than a plain box plot. While a box plot only shows summary statistics such as mean/median and interquartile ranges, the violin plot shows the full distribution of the data. The difference is particularly useful when the data distribution is multimodal (more than one peak).
answered 1 hour ago
user76284user76284
12014 bronze badges
12014 bronze badges
add a comment
|
add a comment
|
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f428445%2fhow-can-i-visualize-an-ordinal-variable-predicting-a-continuous-outcome%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
4
$begingroup$
1. Is there any particular reason to imagine the relationship would be linear? 2. Is there indeed any need to draw any kind of curve or line? Why not simply mark in the means (or any other suitable measure of location) in each category? 3. Can you say more about the continuous outcome? What kind of thing is it?
$endgroup$
– Glen_b♦
17 hours ago