Data standardization vs. normalization for clustering analysisClustering variables with outliersMultivariate...
School House Points (Python + SQLite)
Data standardization vs. normalization for clustering analysis
Was adding milk to tea started to reduce employee tea break time?
Metric version of "footage"?
P-MOSFET failing
What to put after taking off rear stabilisers from child bicyle?
Players of unusual orchestral instruments
Why would an Inquisitive rogue choose to use Insightful Fighting as opposed to using their Cunning Action to Hide?
QGIS Linestring rendering curves between vertex
Can I call 112 to check a police officer's identity in the Czech Republic?
Report how much space is used and available in storage in ZFS on FreeBSD
Can a continent naturally split into two distant parts within a week?
Won 50K! Now what should I do with it
What is temperature on a quantum level?
What would be the ideal melee weapon made of "Phase Metal"?
Alternatives to using writing paper for writing practice
To accent or not to accent in Greek
Is `curl {something} | sudo bash -` a reasonably safe installation method?
Mistakenly modified `/bin/sh'
Find the wrong number in the given series: 6, 12, 21, 36, 56, 81?
Why does Hellboy file down his horns?
Do native speakers use ZVE or CPU?
How might the United Kingdom become a republic?
Does Google Maps take into account hills/inclines for route times?
Data standardization vs. normalization for clustering analysis
Clustering variables with outliersMultivariate data analyis of compositional dataCan we use cluster analysis in multiple regressionk-mean clustering of week-timesClustering not producing even clustersClustering a dense datasetNormalization/Standarization for Clustering visualizationk-modes Clusters ValidationWhy is t-SNE not used as a dimensionality reduction technique for clustering or classification?How to deal with mixed data type in deep neural network
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}
$begingroup$
I'm performing clustering analysis and visualization (hierarchal, PCA, T-SNE etc.) on a dataset, and a bit confused about the method for data preparation. I understand that the typical options are to standardize, normalize, or log transform, but it seems like there are no hard and fast rules regarding when you apply one over the other?
With standardization and log-transformation - my dataset splits into two clusters with a number of different algorithms. One cluster is large and heterogeneous (which is actually interesting as this is a biological problem and makes logical sense). However, if I normalize the data, I get three clusters out of it - splits the heterogeneous cluster into two. This could make sense as well, but it would be a stretch, and the clusters are not as clean. What could be causing this? The non-heterogeneous cluster remains the same, which is reassuring. Is it reasonable to conclude that the "instability" of the second cluster is further evidence of the heterogeneity in the dataset?
machine-learning clustering pca
New contributor
$endgroup$
add a comment |
$begingroup$
I'm performing clustering analysis and visualization (hierarchal, PCA, T-SNE etc.) on a dataset, and a bit confused about the method for data preparation. I understand that the typical options are to standardize, normalize, or log transform, but it seems like there are no hard and fast rules regarding when you apply one over the other?
With standardization and log-transformation - my dataset splits into two clusters with a number of different algorithms. One cluster is large and heterogeneous (which is actually interesting as this is a biological problem and makes logical sense). However, if I normalize the data, I get three clusters out of it - splits the heterogeneous cluster into two. This could make sense as well, but it would be a stretch, and the clusters are not as clean. What could be causing this? The non-heterogeneous cluster remains the same, which is reassuring. Is it reasonable to conclude that the "instability" of the second cluster is further evidence of the heterogeneity in the dataset?
machine-learning clustering pca
New contributor
$endgroup$
add a comment |
$begingroup$
I'm performing clustering analysis and visualization (hierarchal, PCA, T-SNE etc.) on a dataset, and a bit confused about the method for data preparation. I understand that the typical options are to standardize, normalize, or log transform, but it seems like there are no hard and fast rules regarding when you apply one over the other?
With standardization and log-transformation - my dataset splits into two clusters with a number of different algorithms. One cluster is large and heterogeneous (which is actually interesting as this is a biological problem and makes logical sense). However, if I normalize the data, I get three clusters out of it - splits the heterogeneous cluster into two. This could make sense as well, but it would be a stretch, and the clusters are not as clean. What could be causing this? The non-heterogeneous cluster remains the same, which is reassuring. Is it reasonable to conclude that the "instability" of the second cluster is further evidence of the heterogeneity in the dataset?
machine-learning clustering pca
New contributor
$endgroup$
I'm performing clustering analysis and visualization (hierarchal, PCA, T-SNE etc.) on a dataset, and a bit confused about the method for data preparation. I understand that the typical options are to standardize, normalize, or log transform, but it seems like there are no hard and fast rules regarding when you apply one over the other?
With standardization and log-transformation - my dataset splits into two clusters with a number of different algorithms. One cluster is large and heterogeneous (which is actually interesting as this is a biological problem and makes logical sense). However, if I normalize the data, I get three clusters out of it - splits the heterogeneous cluster into two. This could make sense as well, but it would be a stretch, and the clusters are not as clean. What could be causing this? The non-heterogeneous cluster remains the same, which is reassuring. Is it reasonable to conclude that the "instability" of the second cluster is further evidence of the heterogeneity in the dataset?
machine-learning clustering pca
machine-learning clustering pca
New contributor
New contributor
New contributor
asked 8 hours ago
ElicenElicen
111 bronze badge
111 bronze badge
New contributor
New contributor
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
There cannot be a general rule on what to do.
Any automatic normalization is usually "wrong". They only happen to usually work better than not weighting features at all, so people commony use them - in particular on data they don't understand.
But the right way is to weight and scale features such they have the right balanced amount of influence on the results. As there is no mathematical way to capture this "right balance" (it's not uniform!) there cannot be an automatic solution. You have to understand your data and scale each feature to give it he desired amount of influence.
$endgroup$
1
$begingroup$
Anony-Mousse provides a good answer. I'd add that often you are looking for sensible clusters that help the data tell a story. From what you've said in your question, the easier to interpret 2 cluster solution would seem better.
$endgroup$
– zbicyclist
7 hours ago
$begingroup$
Thank you! I guess my question was also - what component of normalizing/standardizing would give rise to these differences in results? Also, the challenge with my dataset is that it is unlabelled (and impossible to label otherwise, common for biological data), and so we are trying to use unsupervised clustering to figure out how many clusters exist. We are currently not weighting the features, but including feature selection.
$endgroup$
– Elicen
7 hours ago
$begingroup$
Normalizing usually is much worse because of outliers. Standardization is much more robust.
$endgroup$
– Anony-Mousse
6 hours ago
add a comment |
$begingroup$
I think standard scaling mostly depends on the model being used, and normalizing depend on how the data is originated
Most of distance based models e.g. k-means need standard scaling so that large-scaled features don't dominate the variation. Same goes to PCA.
About the normalization, it mostly depends on the data. For example, if you have sensor data (each time step being a variable) with different scaling, you need to L2 normalize the data to bring them into the same scale. Or if you are working on customer recommendation and your entry are the number of times they bought each item (items being variables), you might need to L2 normalize the items if you don't want people who buy a lot to skew the feature.
Personally, I think if the variables are well-defined, their log might result in losing interpretaility. So if you get good looking clusters without the log transform, I'd stick to it.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Elicen is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f417339%2fdata-standardization-vs-normalization-for-clustering-analysis%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
There cannot be a general rule on what to do.
Any automatic normalization is usually "wrong". They only happen to usually work better than not weighting features at all, so people commony use them - in particular on data they don't understand.
But the right way is to weight and scale features such they have the right balanced amount of influence on the results. As there is no mathematical way to capture this "right balance" (it's not uniform!) there cannot be an automatic solution. You have to understand your data and scale each feature to give it he desired amount of influence.
$endgroup$
1
$begingroup$
Anony-Mousse provides a good answer. I'd add that often you are looking for sensible clusters that help the data tell a story. From what you've said in your question, the easier to interpret 2 cluster solution would seem better.
$endgroup$
– zbicyclist
7 hours ago
$begingroup$
Thank you! I guess my question was also - what component of normalizing/standardizing would give rise to these differences in results? Also, the challenge with my dataset is that it is unlabelled (and impossible to label otherwise, common for biological data), and so we are trying to use unsupervised clustering to figure out how many clusters exist. We are currently not weighting the features, but including feature selection.
$endgroup$
– Elicen
7 hours ago
$begingroup$
Normalizing usually is much worse because of outliers. Standardization is much more robust.
$endgroup$
– Anony-Mousse
6 hours ago
add a comment |
$begingroup$
There cannot be a general rule on what to do.
Any automatic normalization is usually "wrong". They only happen to usually work better than not weighting features at all, so people commony use them - in particular on data they don't understand.
But the right way is to weight and scale features such they have the right balanced amount of influence on the results. As there is no mathematical way to capture this "right balance" (it's not uniform!) there cannot be an automatic solution. You have to understand your data and scale each feature to give it he desired amount of influence.
$endgroup$
1
$begingroup$
Anony-Mousse provides a good answer. I'd add that often you are looking for sensible clusters that help the data tell a story. From what you've said in your question, the easier to interpret 2 cluster solution would seem better.
$endgroup$
– zbicyclist
7 hours ago
$begingroup$
Thank you! I guess my question was also - what component of normalizing/standardizing would give rise to these differences in results? Also, the challenge with my dataset is that it is unlabelled (and impossible to label otherwise, common for biological data), and so we are trying to use unsupervised clustering to figure out how many clusters exist. We are currently not weighting the features, but including feature selection.
$endgroup$
– Elicen
7 hours ago
$begingroup$
Normalizing usually is much worse because of outliers. Standardization is much more robust.
$endgroup$
– Anony-Mousse
6 hours ago
add a comment |
$begingroup$
There cannot be a general rule on what to do.
Any automatic normalization is usually "wrong". They only happen to usually work better than not weighting features at all, so people commony use them - in particular on data they don't understand.
But the right way is to weight and scale features such they have the right balanced amount of influence on the results. As there is no mathematical way to capture this "right balance" (it's not uniform!) there cannot be an automatic solution. You have to understand your data and scale each feature to give it he desired amount of influence.
$endgroup$
There cannot be a general rule on what to do.
Any automatic normalization is usually "wrong". They only happen to usually work better than not weighting features at all, so people commony use them - in particular on data they don't understand.
But the right way is to weight and scale features such they have the right balanced amount of influence on the results. As there is no mathematical way to capture this "right balance" (it's not uniform!) there cannot be an automatic solution. You have to understand your data and scale each feature to give it he desired amount of influence.
answered 8 hours ago
Anony-MousseAnony-Mousse
32k5 gold badges44 silver badges85 bronze badges
32k5 gold badges44 silver badges85 bronze badges
1
$begingroup$
Anony-Mousse provides a good answer. I'd add that often you are looking for sensible clusters that help the data tell a story. From what you've said in your question, the easier to interpret 2 cluster solution would seem better.
$endgroup$
– zbicyclist
7 hours ago
$begingroup$
Thank you! I guess my question was also - what component of normalizing/standardizing would give rise to these differences in results? Also, the challenge with my dataset is that it is unlabelled (and impossible to label otherwise, common for biological data), and so we are trying to use unsupervised clustering to figure out how many clusters exist. We are currently not weighting the features, but including feature selection.
$endgroup$
– Elicen
7 hours ago
$begingroup$
Normalizing usually is much worse because of outliers. Standardization is much more robust.
$endgroup$
– Anony-Mousse
6 hours ago
add a comment |
1
$begingroup$
Anony-Mousse provides a good answer. I'd add that often you are looking for sensible clusters that help the data tell a story. From what you've said in your question, the easier to interpret 2 cluster solution would seem better.
$endgroup$
– zbicyclist
7 hours ago
$begingroup$
Thank you! I guess my question was also - what component of normalizing/standardizing would give rise to these differences in results? Also, the challenge with my dataset is that it is unlabelled (and impossible to label otherwise, common for biological data), and so we are trying to use unsupervised clustering to figure out how many clusters exist. We are currently not weighting the features, but including feature selection.
$endgroup$
– Elicen
7 hours ago
$begingroup$
Normalizing usually is much worse because of outliers. Standardization is much more robust.
$endgroup$
– Anony-Mousse
6 hours ago
1
1
$begingroup$
Anony-Mousse provides a good answer. I'd add that often you are looking for sensible clusters that help the data tell a story. From what you've said in your question, the easier to interpret 2 cluster solution would seem better.
$endgroup$
– zbicyclist
7 hours ago
$begingroup$
Anony-Mousse provides a good answer. I'd add that often you are looking for sensible clusters that help the data tell a story. From what you've said in your question, the easier to interpret 2 cluster solution would seem better.
$endgroup$
– zbicyclist
7 hours ago
$begingroup$
Thank you! I guess my question was also - what component of normalizing/standardizing would give rise to these differences in results? Also, the challenge with my dataset is that it is unlabelled (and impossible to label otherwise, common for biological data), and so we are trying to use unsupervised clustering to figure out how many clusters exist. We are currently not weighting the features, but including feature selection.
$endgroup$
– Elicen
7 hours ago
$begingroup$
Thank you! I guess my question was also - what component of normalizing/standardizing would give rise to these differences in results? Also, the challenge with my dataset is that it is unlabelled (and impossible to label otherwise, common for biological data), and so we are trying to use unsupervised clustering to figure out how many clusters exist. We are currently not weighting the features, but including feature selection.
$endgroup$
– Elicen
7 hours ago
$begingroup$
Normalizing usually is much worse because of outliers. Standardization is much more robust.
$endgroup$
– Anony-Mousse
6 hours ago
$begingroup$
Normalizing usually is much worse because of outliers. Standardization is much more robust.
$endgroup$
– Anony-Mousse
6 hours ago
add a comment |
$begingroup$
I think standard scaling mostly depends on the model being used, and normalizing depend on how the data is originated
Most of distance based models e.g. k-means need standard scaling so that large-scaled features don't dominate the variation. Same goes to PCA.
About the normalization, it mostly depends on the data. For example, if you have sensor data (each time step being a variable) with different scaling, you need to L2 normalize the data to bring them into the same scale. Or if you are working on customer recommendation and your entry are the number of times they bought each item (items being variables), you might need to L2 normalize the items if you don't want people who buy a lot to skew the feature.
Personally, I think if the variables are well-defined, their log might result in losing interpretaility. So if you get good looking clusters without the log transform, I'd stick to it.
$endgroup$
add a comment |
$begingroup$
I think standard scaling mostly depends on the model being used, and normalizing depend on how the data is originated
Most of distance based models e.g. k-means need standard scaling so that large-scaled features don't dominate the variation. Same goes to PCA.
About the normalization, it mostly depends on the data. For example, if you have sensor data (each time step being a variable) with different scaling, you need to L2 normalize the data to bring them into the same scale. Or if you are working on customer recommendation and your entry are the number of times they bought each item (items being variables), you might need to L2 normalize the items if you don't want people who buy a lot to skew the feature.
Personally, I think if the variables are well-defined, their log might result in losing interpretaility. So if you get good looking clusters without the log transform, I'd stick to it.
$endgroup$
add a comment |
$begingroup$
I think standard scaling mostly depends on the model being used, and normalizing depend on how the data is originated
Most of distance based models e.g. k-means need standard scaling so that large-scaled features don't dominate the variation. Same goes to PCA.
About the normalization, it mostly depends on the data. For example, if you have sensor data (each time step being a variable) with different scaling, you need to L2 normalize the data to bring them into the same scale. Or if you are working on customer recommendation and your entry are the number of times they bought each item (items being variables), you might need to L2 normalize the items if you don't want people who buy a lot to skew the feature.
Personally, I think if the variables are well-defined, their log might result in losing interpretaility. So if you get good looking clusters without the log transform, I'd stick to it.
$endgroup$
I think standard scaling mostly depends on the model being used, and normalizing depend on how the data is originated
Most of distance based models e.g. k-means need standard scaling so that large-scaled features don't dominate the variation. Same goes to PCA.
About the normalization, it mostly depends on the data. For example, if you have sensor data (each time step being a variable) with different scaling, you need to L2 normalize the data to bring them into the same scale. Or if you are working on customer recommendation and your entry are the number of times they bought each item (items being variables), you might need to L2 normalize the items if you don't want people who buy a lot to skew the feature.
Personally, I think if the variables are well-defined, their log might result in losing interpretaility. So if you get good looking clusters without the log transform, I'd stick to it.
answered 7 hours ago
aghdaghd
255 bronze badges
255 bronze badges
add a comment |
add a comment |
Elicen is a new contributor. Be nice, and check out our Code of Conduct.
Elicen is a new contributor. Be nice, and check out our Code of Conduct.
Elicen is a new contributor. Be nice, and check out our Code of Conduct.
Elicen is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f417339%2fdata-standardization-vs-normalization-for-clustering-analysis%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown