How to know the operations made to calculate the Levenshtein distance between strings?What is the difference...
What does this Pokemon Trainer mean by saying the player is "SHELLOS"?
Merging two data frames into a new one with unique items marked with 1 or 0
Did the Shuttle payload bay have illumination?
Can I hire several veteran soldiers to accompany me?
Which are more efficient in putting out wildfires: planes or helicopters?
Was Wolfgang Unziker the last Amateur GM?
How can I change my buffer system for protein purification?
When does it become illegal to exchange bitcoin for cash?
Installed software from source, how to say yum not to install it from package?
Russian equivalents of 能骗就骗 (if you can cheat, then cheat)
How to model a Coral or Sponge Structure?
Can you run PoE Cat6 alongside standard Cat6 cables?
My players like to search everything. What do they find?
Why can't i use !(single pattern) in zsh even after i turn on kshglob?
Does "boire un jus" tend to mean "coffee" or "juice of fruit"?
Sentences with no verb, but an ablative
How do I tell my girlfriend she's been buying me books by the wrong author for the last nine months?
Why will we fail creating a self sustaining off world colony?
Which high-degree derivatives play an essential role?
Emphasize numbers in tables
Tricky riddle from sister
Runtime too long for NDSolveValue, FindRoot breaks down at sharp turns
Is it theoretically possible to hack printer using scanner tray?
Non-inverting amplifier ; Single supply ; Bipolar input
How to know the operations made to calculate the Levenshtein distance between strings?
What is the difference between String and string in C#?How to check if a string contains a substring in BashHow do I iterate over the words of a string?How do I read / convert an InputStream into a String in Java?How to substring a string in Python?How do I make the first letter of a string uppercase in JavaScript?How to replace all occurrences of a string in JavaScriptHow to check whether a string contains a substring in JavaScript?How do I check if a string contains a specific word?How do I convert a String to an int in Java?
With the function stringdist
, I can calculate the Levenshtein distance between strings : it counts the number of deletions, insertions and substitutions necessary to turn a string into another. For instance, stringdist("abc abc","abcd abc") = 1
because "d" was inserted in the second string.
Is it possible to know the operations made to obtain the Levenshtein distance between two strings ? Or else to know the characters that are different between the 2 strings (in this example, only "d")?
Thanks.
library(stringdist)
stringdist("abc abc","abcde acc") = 3
I would like to know that :
"d" was inserted
"e" was inserted
"b" was substitued into "c"
Or more simply, I would like to have the list ("d","e","c").
r string levenshtein-distance stringdist
New contributor
add a comment |
With the function stringdist
, I can calculate the Levenshtein distance between strings : it counts the number of deletions, insertions and substitutions necessary to turn a string into another. For instance, stringdist("abc abc","abcd abc") = 1
because "d" was inserted in the second string.
Is it possible to know the operations made to obtain the Levenshtein distance between two strings ? Or else to know the characters that are different between the 2 strings (in this example, only "d")?
Thanks.
library(stringdist)
stringdist("abc abc","abcde acc") = 3
I would like to know that :
"d" was inserted
"e" was inserted
"b" was substitued into "c"
Or more simply, I would like to have the list ("d","e","c").
r string levenshtein-distance stringdist
New contributor
1
I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.
– Rui Barradas
8 hours ago
It could help me for my research because I'm trying to know the differences between strings. Thanks for the link
– yaki
8 hours ago
@RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)
– Konrad Rudolph
7 hours ago
add a comment |
With the function stringdist
, I can calculate the Levenshtein distance between strings : it counts the number of deletions, insertions and substitutions necessary to turn a string into another. For instance, stringdist("abc abc","abcd abc") = 1
because "d" was inserted in the second string.
Is it possible to know the operations made to obtain the Levenshtein distance between two strings ? Or else to know the characters that are different between the 2 strings (in this example, only "d")?
Thanks.
library(stringdist)
stringdist("abc abc","abcde acc") = 3
I would like to know that :
"d" was inserted
"e" was inserted
"b" was substitued into "c"
Or more simply, I would like to have the list ("d","e","c").
r string levenshtein-distance stringdist
New contributor
With the function stringdist
, I can calculate the Levenshtein distance between strings : it counts the number of deletions, insertions and substitutions necessary to turn a string into another. For instance, stringdist("abc abc","abcd abc") = 1
because "d" was inserted in the second string.
Is it possible to know the operations made to obtain the Levenshtein distance between two strings ? Or else to know the characters that are different between the 2 strings (in this example, only "d")?
Thanks.
library(stringdist)
stringdist("abc abc","abcde acc") = 3
I would like to know that :
"d" was inserted
"e" was inserted
"b" was substitued into "c"
Or more simply, I would like to have the list ("d","e","c").
r string levenshtein-distance stringdist
r string levenshtein-distance stringdist
New contributor
New contributor
edited 7 hours ago
Konrad Rudolph
411k103 gold badges805 silver badges1051 bronze badges
411k103 gold badges805 silver badges1051 bronze badges
New contributor
asked 8 hours ago
yakiyaki
434 bronze badges
434 bronze badges
New contributor
New contributor
1
I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.
– Rui Barradas
8 hours ago
It could help me for my research because I'm trying to know the differences between strings. Thanks for the link
– yaki
8 hours ago
@RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)
– Konrad Rudolph
7 hours ago
add a comment |
1
I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.
– Rui Barradas
8 hours ago
It could help me for my research because I'm trying to know the differences between strings. Thanks for the link
– yaki
8 hours ago
@RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)
– Konrad Rudolph
7 hours ago
1
1
I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.
– Rui Barradas
8 hours ago
I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.
– Rui Barradas
8 hours ago
It could help me for my research because I'm trying to know the differences between strings. Thanks for the link
– yaki
8 hours ago
It could help me for my research because I'm trying to know the differences between strings. Thanks for the link
– yaki
8 hours ago
@RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)
– Konrad Rudolph
7 hours ago
@RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)
– Konrad Rudolph
7 hours ago
add a comment |
3 Answers
3
active
oldest
votes
With adist()
, you can retrieve the operations:
drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))
ins del sub
2 0 1
From ?adist
:
If counts is TRUE, the transformation counts are returned as the
"counts" attribute of this matrix, as a 3-dimensional array with
dimensions corresponding to the elements of x, the elements of y, and
the type of transformation (insertions, deletions and substitutions),
respectively.
Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function usingattr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM"
whereM=match, S=substitute, D=delete, I=insert
– yaki
8 hours ago
1
Don't know about any handy function that will do it. However, I assume that playing aroundtrafos
will lead you to the desired results.
– tmfmnk
8 hours ago
add a comment |
Building off tmfmnk's answer and the suggestion to play around with the "trafos" attribute, here's a function which will show you a table of all the characters inserted or substituted, and how many times they were inserted and substituted. If you set all_actions = T
it will show you matches also.
f <- function(x, y, all_actions = FALSE){
o <- adist(x, y, count = TRUE)
cva <-
list(char = strsplit(y, '')[[1]],
action = strsplit(attr(o,"trafos"), '')[[1]])
if(!all_actions)
cva <- lapply(cva, '[', cva$action %in% c('I', 'S'))
do.call(table, cva)
}
f(x = "abc abc", y = "abcde acc")
# action
# char I S
# c 0 1
# d 1 0
# e 1 0
f(x = "abc abc", y = "abcde acc", all_actions = T)
# action
# char I M S
# 0 1 0
# a 0 2 0
# b 0 1 0
# c 0 2 1
# d 1 0 0
# e 1 0 0
add a comment |
This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.
Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package {Biostrings}, which is part of Bioconductor.
Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:
library(Biostrings)
dist_mat = diag(27L)
colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')
result = pairwiseAlignment(
"abc abc", "abcde acc",
substitutionMatrix = dist_mat,
gapOpening = 1, gapExtension = 1
)
This won’t simply give you the list c('b', 'c', 'c')
, though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:
score(result)
# [1] 3
aligned(result)
as.matrix(aligned(result))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "a" "b" "c" "-" "-" " " "a" "b" "c"
aligned(result)
— For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -
. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).
Unfortunately the code above requires you to specifydist_mat
manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.
– Konrad Rudolph
6 hours ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
yaki is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f56827772%2fhow-to-know-the-operations-made-to-calculate-the-levenshtein-distance-between-st%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
With adist()
, you can retrieve the operations:
drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))
ins del sub
2 0 1
From ?adist
:
If counts is TRUE, the transformation counts are returned as the
"counts" attribute of this matrix, as a 3-dimensional array with
dimensions corresponding to the elements of x, the elements of y, and
the type of transformation (insertions, deletions and substitutions),
respectively.
Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function usingattr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM"
whereM=match, S=substitute, D=delete, I=insert
– yaki
8 hours ago
1
Don't know about any handy function that will do it. However, I assume that playing aroundtrafos
will lead you to the desired results.
– tmfmnk
8 hours ago
add a comment |
With adist()
, you can retrieve the operations:
drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))
ins del sub
2 0 1
From ?adist
:
If counts is TRUE, the transformation counts are returned as the
"counts" attribute of this matrix, as a 3-dimensional array with
dimensions corresponding to the elements of x, the elements of y, and
the type of transformation (insertions, deletions and substitutions),
respectively.
Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function usingattr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM"
whereM=match, S=substitute, D=delete, I=insert
– yaki
8 hours ago
1
Don't know about any handy function that will do it. However, I assume that playing aroundtrafos
will lead you to the desired results.
– tmfmnk
8 hours ago
add a comment |
With adist()
, you can retrieve the operations:
drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))
ins del sub
2 0 1
From ?adist
:
If counts is TRUE, the transformation counts are returned as the
"counts" attribute of this matrix, as a 3-dimensional array with
dimensions corresponding to the elements of x, the elements of y, and
the type of transformation (insertions, deletions and substitutions),
respectively.
With adist()
, you can retrieve the operations:
drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))
ins del sub
2 0 1
From ?adist
:
If counts is TRUE, the transformation counts are returned as the
"counts" attribute of this matrix, as a 3-dimensional array with
dimensions corresponding to the elements of x, the elements of y, and
the type of transformation (insertions, deletions and substitutions),
respectively.
edited 8 hours ago
answered 8 hours ago
tmfmnktmfmnk
7,8981 gold badge8 silver badges21 bronze badges
7,8981 gold badge8 silver badges21 bronze badges
Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function usingattr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM"
whereM=match, S=substitute, D=delete, I=insert
– yaki
8 hours ago
1
Don't know about any handy function that will do it. However, I assume that playing aroundtrafos
will lead you to the desired results.
– tmfmnk
8 hours ago
add a comment |
Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function usingattr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM"
whereM=match, S=substitute, D=delete, I=insert
– yaki
8 hours ago
1
Don't know about any handy function that will do it. However, I assume that playing aroundtrafos
will lead you to the desired results.
– tmfmnk
8 hours ago
Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function using
attr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM"
where M=match, S=substitute, D=delete, I=insert
– yaki
8 hours ago
Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function using
attr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM"
where M=match, S=substitute, D=delete, I=insert
– yaki
8 hours ago
1
1
Don't know about any handy function that will do it. However, I assume that playing around
trafos
will lead you to the desired results.– tmfmnk
8 hours ago
Don't know about any handy function that will do it. However, I assume that playing around
trafos
will lead you to the desired results.– tmfmnk
8 hours ago
add a comment |
Building off tmfmnk's answer and the suggestion to play around with the "trafos" attribute, here's a function which will show you a table of all the characters inserted or substituted, and how many times they were inserted and substituted. If you set all_actions = T
it will show you matches also.
f <- function(x, y, all_actions = FALSE){
o <- adist(x, y, count = TRUE)
cva <-
list(char = strsplit(y, '')[[1]],
action = strsplit(attr(o,"trafos"), '')[[1]])
if(!all_actions)
cva <- lapply(cva, '[', cva$action %in% c('I', 'S'))
do.call(table, cva)
}
f(x = "abc abc", y = "abcde acc")
# action
# char I S
# c 0 1
# d 1 0
# e 1 0
f(x = "abc abc", y = "abcde acc", all_actions = T)
# action
# char I M S
# 0 1 0
# a 0 2 0
# b 0 1 0
# c 0 2 1
# d 1 0 0
# e 1 0 0
add a comment |
Building off tmfmnk's answer and the suggestion to play around with the "trafos" attribute, here's a function which will show you a table of all the characters inserted or substituted, and how many times they were inserted and substituted. If you set all_actions = T
it will show you matches also.
f <- function(x, y, all_actions = FALSE){
o <- adist(x, y, count = TRUE)
cva <-
list(char = strsplit(y, '')[[1]],
action = strsplit(attr(o,"trafos"), '')[[1]])
if(!all_actions)
cva <- lapply(cva, '[', cva$action %in% c('I', 'S'))
do.call(table, cva)
}
f(x = "abc abc", y = "abcde acc")
# action
# char I S
# c 0 1
# d 1 0
# e 1 0
f(x = "abc abc", y = "abcde acc", all_actions = T)
# action
# char I M S
# 0 1 0
# a 0 2 0
# b 0 1 0
# c 0 2 1
# d 1 0 0
# e 1 0 0
add a comment |
Building off tmfmnk's answer and the suggestion to play around with the "trafos" attribute, here's a function which will show you a table of all the characters inserted or substituted, and how many times they were inserted and substituted. If you set all_actions = T
it will show you matches also.
f <- function(x, y, all_actions = FALSE){
o <- adist(x, y, count = TRUE)
cva <-
list(char = strsplit(y, '')[[1]],
action = strsplit(attr(o,"trafos"), '')[[1]])
if(!all_actions)
cva <- lapply(cva, '[', cva$action %in% c('I', 'S'))
do.call(table, cva)
}
f(x = "abc abc", y = "abcde acc")
# action
# char I S
# c 0 1
# d 1 0
# e 1 0
f(x = "abc abc", y = "abcde acc", all_actions = T)
# action
# char I M S
# 0 1 0
# a 0 2 0
# b 0 1 0
# c 0 2 1
# d 1 0 0
# e 1 0 0
Building off tmfmnk's answer and the suggestion to play around with the "trafos" attribute, here's a function which will show you a table of all the characters inserted or substituted, and how many times they were inserted and substituted. If you set all_actions = T
it will show you matches also.
f <- function(x, y, all_actions = FALSE){
o <- adist(x, y, count = TRUE)
cva <-
list(char = strsplit(y, '')[[1]],
action = strsplit(attr(o,"trafos"), '')[[1]])
if(!all_actions)
cva <- lapply(cva, '[', cva$action %in% c('I', 'S'))
do.call(table, cva)
}
f(x = "abc abc", y = "abcde acc")
# action
# char I S
# c 0 1
# d 1 0
# e 1 0
f(x = "abc abc", y = "abcde acc", all_actions = T)
# action
# char I M S
# 0 1 0
# a 0 2 0
# b 0 1 0
# c 0 2 1
# d 1 0 0
# e 1 0 0
edited 5 hours ago
answered 7 hours ago
IceCreamToucanIceCreamToucan
12.7k1 gold badge8 silver badges19 bronze badges
12.7k1 gold badge8 silver badges19 bronze badges
add a comment |
add a comment |
This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.
Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package {Biostrings}, which is part of Bioconductor.
Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:
library(Biostrings)
dist_mat = diag(27L)
colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')
result = pairwiseAlignment(
"abc abc", "abcde acc",
substitutionMatrix = dist_mat,
gapOpening = 1, gapExtension = 1
)
This won’t simply give you the list c('b', 'c', 'c')
, though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:
score(result)
# [1] 3
aligned(result)
as.matrix(aligned(result))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "a" "b" "c" "-" "-" " " "a" "b" "c"
aligned(result)
— For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -
. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).
Unfortunately the code above requires you to specifydist_mat
manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.
– Konrad Rudolph
6 hours ago
add a comment |
This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.
Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package {Biostrings}, which is part of Bioconductor.
Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:
library(Biostrings)
dist_mat = diag(27L)
colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')
result = pairwiseAlignment(
"abc abc", "abcde acc",
substitutionMatrix = dist_mat,
gapOpening = 1, gapExtension = 1
)
This won’t simply give you the list c('b', 'c', 'c')
, though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:
score(result)
# [1] 3
aligned(result)
as.matrix(aligned(result))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "a" "b" "c" "-" "-" " " "a" "b" "c"
aligned(result)
— For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -
. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).
Unfortunately the code above requires you to specifydist_mat
manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.
– Konrad Rudolph
6 hours ago
add a comment |
This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.
Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package {Biostrings}, which is part of Bioconductor.
Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:
library(Biostrings)
dist_mat = diag(27L)
colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')
result = pairwiseAlignment(
"abc abc", "abcde acc",
substitutionMatrix = dist_mat,
gapOpening = 1, gapExtension = 1
)
This won’t simply give you the list c('b', 'c', 'c')
, though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:
score(result)
# [1] 3
aligned(result)
as.matrix(aligned(result))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "a" "b" "c" "-" "-" " " "a" "b" "c"
aligned(result)
— For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -
. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).
This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.
Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package {Biostrings}, which is part of Bioconductor.
Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:
library(Biostrings)
dist_mat = diag(27L)
colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')
result = pairwiseAlignment(
"abc abc", "abcde acc",
substitutionMatrix = dist_mat,
gapOpening = 1, gapExtension = 1
)
This won’t simply give you the list c('b', 'c', 'c')
, though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:
score(result)
# [1] 3
aligned(result)
as.matrix(aligned(result))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "a" "b" "c" "-" "-" " " "a" "b" "c"
aligned(result)
— For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -
. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).
answered 6 hours ago
Konrad RudolphKonrad Rudolph
411k103 gold badges805 silver badges1051 bronze badges
411k103 gold badges805 silver badges1051 bronze badges
Unfortunately the code above requires you to specifydist_mat
manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.
– Konrad Rudolph
6 hours ago
add a comment |
Unfortunately the code above requires you to specifydist_mat
manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.
– Konrad Rudolph
6 hours ago
Unfortunately the code above requires you to specify
dist_mat
manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.– Konrad Rudolph
6 hours ago
Unfortunately the code above requires you to specify
dist_mat
manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.– Konrad Rudolph
6 hours ago
add a comment |
yaki is a new contributor. Be nice, and check out our Code of Conduct.
yaki is a new contributor. Be nice, and check out our Code of Conduct.
yaki is a new contributor. Be nice, and check out our Code of Conduct.
yaki is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f56827772%2fhow-to-know-the-operations-made-to-calculate-the-levenshtein-distance-between-st%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.
– Rui Barradas
8 hours ago
It could help me for my research because I'm trying to know the differences between strings. Thanks for the link
– yaki
8 hours ago
@RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)
– Konrad Rudolph
7 hours ago