How to know the operations made to calculate the Levenshtein distance between strings?What is the difference...

What does this Pokemon Trainer mean by saying the player is "SHELLOS"?

Merging two data frames into a new one with unique items marked with 1 or 0

Did the Shuttle payload bay have illumination?

Can I hire several veteran soldiers to accompany me?

Which are more efficient in putting out wildfires: planes or helicopters?

Was Wolfgang Unziker the last Amateur GM?

How can I change my buffer system for protein purification?

When does it become illegal to exchange bitcoin for cash?

Installed software from source, how to say yum not to install it from package?

Russian equivalents of 能骗就骗 (if you can cheat, then cheat)

How to model a Coral or Sponge Structure?

Can you run PoE Cat6 alongside standard Cat6 cables?

My players like to search everything. What do they find?

Why can't i use !(single pattern) in zsh even after i turn on kshglob?

Does "boire un jus" tend to mean "coffee" or "juice of fruit"?

Sentences with no verb, but an ablative

How do I tell my girlfriend she's been buying me books by the wrong author for the last nine months?

Why will we fail creating a self sustaining off world colony?

Which high-degree derivatives play an essential role?

Emphasize numbers in tables

Tricky riddle from sister

Runtime too long for NDSolveValue, FindRoot breaks down at sharp turns

Is it theoretically possible to hack printer using scanner tray?

Non-inverting amplifier ; Single supply ; Bipolar input



How to know the operations made to calculate the Levenshtein distance between strings?


What is the difference between String and string in C#?How to check if a string contains a substring in BashHow do I iterate over the words of a string?How do I read / convert an InputStream into a String in Java?How to substring a string in Python?How do I make the first letter of a string uppercase in JavaScript?How to replace all occurrences of a string in JavaScriptHow to check whether a string contains a substring in JavaScript?How do I check if a string contains a specific word?How do I convert a String to an int in Java?













6















With the function stringdist, I can calculate the Levenshtein distance between strings : it counts the number of deletions, insertions and substitutions necessary to turn a string into another. For instance, stringdist("abc abc","abcd abc") = 1 because "d" was inserted in the second string.



Is it possible to know the operations made to obtain the Levenshtein distance between two strings ? Or else to know the characters that are different between the 2 strings (in this example, only "d")?
Thanks.



library(stringdist)
stringdist("abc abc","abcde acc") = 3


I would like to know that :




  • "d" was inserted


  • "e" was inserted


  • "b" was substitued into "c"



Or more simply, I would like to have the list ("d","e","c").










share|improve this question









New contributor



yaki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.















  • 1





    I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.

    – Rui Barradas
    8 hours ago













  • It could help me for my research because I'm trying to know the differences between strings. Thanks for the link

    – yaki
    8 hours ago











  • @RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)

    – Konrad Rudolph
    7 hours ago
















6















With the function stringdist, I can calculate the Levenshtein distance between strings : it counts the number of deletions, insertions and substitutions necessary to turn a string into another. For instance, stringdist("abc abc","abcd abc") = 1 because "d" was inserted in the second string.



Is it possible to know the operations made to obtain the Levenshtein distance between two strings ? Or else to know the characters that are different between the 2 strings (in this example, only "d")?
Thanks.



library(stringdist)
stringdist("abc abc","abcde acc") = 3


I would like to know that :




  • "d" was inserted


  • "e" was inserted


  • "b" was substitued into "c"



Or more simply, I would like to have the list ("d","e","c").










share|improve this question









New contributor



yaki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.















  • 1





    I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.

    – Rui Barradas
    8 hours ago













  • It could help me for my research because I'm trying to know the differences between strings. Thanks for the link

    – yaki
    8 hours ago











  • @RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)

    – Konrad Rudolph
    7 hours ago














6












6








6


2






With the function stringdist, I can calculate the Levenshtein distance between strings : it counts the number of deletions, insertions and substitutions necessary to turn a string into another. For instance, stringdist("abc abc","abcd abc") = 1 because "d" was inserted in the second string.



Is it possible to know the operations made to obtain the Levenshtein distance between two strings ? Or else to know the characters that are different between the 2 strings (in this example, only "d")?
Thanks.



library(stringdist)
stringdist("abc abc","abcde acc") = 3


I would like to know that :




  • "d" was inserted


  • "e" was inserted


  • "b" was substitued into "c"



Or more simply, I would like to have the list ("d","e","c").










share|improve this question









New contributor



yaki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











With the function stringdist, I can calculate the Levenshtein distance between strings : it counts the number of deletions, insertions and substitutions necessary to turn a string into another. For instance, stringdist("abc abc","abcd abc") = 1 because "d" was inserted in the second string.



Is it possible to know the operations made to obtain the Levenshtein distance between two strings ? Or else to know the characters that are different between the 2 strings (in this example, only "d")?
Thanks.



library(stringdist)
stringdist("abc abc","abcde acc") = 3


I would like to know that :




  • "d" was inserted


  • "e" was inserted


  • "b" was substitued into "c"



Or more simply, I would like to have the list ("d","e","c").







r string levenshtein-distance stringdist






share|improve this question









New contributor



yaki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.










share|improve this question









New contributor



yaki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








share|improve this question




share|improve this question








edited 7 hours ago









Konrad Rudolph

411k103 gold badges805 silver badges1051 bronze badges




411k103 gold badges805 silver badges1051 bronze badges






New contributor



yaki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








asked 8 hours ago









yakiyaki

434 bronze badges




434 bronze badges




New contributor



yaki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




New contributor




yaki is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.










  • 1





    I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.

    – Rui Barradas
    8 hours ago













  • It could help me for my research because I'm trying to know the differences between strings. Thanks for the link

    – yaki
    8 hours ago











  • @RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)

    – Konrad Rudolph
    7 hours ago














  • 1





    I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.

    – Rui Barradas
    8 hours ago













  • It could help me for my research because I'm trying to know the differences between strings. Thanks for the link

    – yaki
    8 hours ago











  • @RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)

    – Konrad Rudolph
    7 hours ago








1




1





I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.

– Rui Barradas
8 hours ago







I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.

– Rui Barradas
8 hours ago















It could help me for my research because I'm trying to know the differences between strings. Thanks for the link

– yaki
8 hours ago





It could help me for my research because I'm trying to know the differences between strings. Thanks for the link

– yaki
8 hours ago













@RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)

– Konrad Rudolph
7 hours ago





@RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)

– Konrad Rudolph
7 hours ago










3 Answers
3






active

oldest

votes


















6














With adist(), you can retrieve the operations:



drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))

ins del sub
2 0 1


From ?adist:




If counts is TRUE, the transformation counts are returned as the
"counts" attribute of this matrix, as a 3-dimensional array with
dimensions corresponding to the elements of x, the elements of y, and
the type of transformation (insertions, deletions and substitutions),
respectively.







share|improve this answer


























  • Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function using attr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM" where M=match, S=substitute, D=delete, I=insert

    – yaki
    8 hours ago








  • 1





    Don't know about any handy function that will do it. However, I assume that playing around trafos will lead you to the desired results.

    – tmfmnk
    8 hours ago



















4














Building off tmfmnk's answer and the suggestion to play around with the "trafos" attribute, here's a function which will show you a table of all the characters inserted or substituted, and how many times they were inserted and substituted. If you set all_actions = T it will show you matches also.



f <- function(x, y, all_actions = FALSE){
o <- adist(x, y, count = TRUE)
cva <-
list(char = strsplit(y, '')[[1]],
action = strsplit(attr(o,"trafos"), '')[[1]])
if(!all_actions)
cva <- lapply(cva, '[', cva$action %in% c('I', 'S'))
do.call(table, cva)
}

f(x = "abc abc", y = "abcde acc")
# action
# char I S
# c 0 1
# d 1 0
# e 1 0

f(x = "abc abc", y = "abcde acc", all_actions = T)
# action
# char I M S
# 0 1 0
# a 0 2 0
# b 0 1 0
# c 0 2 1
# d 1 0 0
# e 1 0 0





share|improve this answer

































    3














    This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.



    Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package {Biostrings}, which is part of Bioconductor.



    Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:



    library(Biostrings)

    dist_mat = diag(27L)
    colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')

    result = pairwiseAlignment(
    "abc abc", "abcde acc",
    substitutionMatrix = dist_mat,
    gapOpening = 1, gapExtension = 1
    )


    This won’t simply give you the list c('b', 'c', 'c'), though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:



    score(result)
    # [1] 3
    aligned(result)
    as.matrix(aligned(result))
    # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
    # [1,] "a" "b" "c" "-" "-" " " "a" "b" "c"
    aligned(result)


    — For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).






    share|improve this answer
























    • Unfortunately the code above requires you to specify dist_mat manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.

      – Konrad Rudolph
      6 hours ago














    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });






    yaki is a new contributor. Be nice, and check out our Code of Conduct.










    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f56827772%2fhow-to-know-the-operations-made-to-calculate-the-levenshtein-distance-between-st%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    6














    With adist(), you can retrieve the operations:



    drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))

    ins del sub
    2 0 1


    From ?adist:




    If counts is TRUE, the transformation counts are returned as the
    "counts" attribute of this matrix, as a 3-dimensional array with
    dimensions corresponding to the elements of x, the elements of y, and
    the type of transformation (insertions, deletions and substitutions),
    respectively.







    share|improve this answer


























    • Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function using attr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM" where M=match, S=substitute, D=delete, I=insert

      – yaki
      8 hours ago








    • 1





      Don't know about any handy function that will do it. However, I assume that playing around trafos will lead you to the desired results.

      – tmfmnk
      8 hours ago
















    6














    With adist(), you can retrieve the operations:



    drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))

    ins del sub
    2 0 1


    From ?adist:




    If counts is TRUE, the transformation counts are returned as the
    "counts" attribute of this matrix, as a 3-dimensional array with
    dimensions corresponding to the elements of x, the elements of y, and
    the type of transformation (insertions, deletions and substitutions),
    respectively.







    share|improve this answer


























    • Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function using attr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM" where M=match, S=substitute, D=delete, I=insert

      – yaki
      8 hours ago








    • 1





      Don't know about any handy function that will do it. However, I assume that playing around trafos will lead you to the desired results.

      – tmfmnk
      8 hours ago














    6












    6








    6







    With adist(), you can retrieve the operations:



    drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))

    ins del sub
    2 0 1


    From ?adist:




    If counts is TRUE, the transformation counts are returned as the
    "counts" attribute of this matrix, as a 3-dimensional array with
    dimensions corresponding to the elements of x, the elements of y, and
    the type of transformation (insertions, deletions and substitutions),
    respectively.







    share|improve this answer















    With adist(), you can retrieve the operations:



    drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))

    ins del sub
    2 0 1


    From ?adist:




    If counts is TRUE, the transformation counts are returned as the
    "counts" attribute of this matrix, as a 3-dimensional array with
    dimensions corresponding to the elements of x, the elements of y, and
    the type of transformation (insertions, deletions and substitutions),
    respectively.








    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited 8 hours ago

























    answered 8 hours ago









    tmfmnktmfmnk

    7,8981 gold badge8 silver badges21 bronze badges




    7,8981 gold badge8 silver badges21 bronze badges













    • Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function using attr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM" where M=match, S=substitute, D=delete, I=insert

      – yaki
      8 hours ago








    • 1





      Don't know about any handy function that will do it. However, I assume that playing around trafos will lead you to the desired results.

      – tmfmnk
      8 hours ago



















    • Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function using attr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM" where M=match, S=substitute, D=delete, I=insert

      – yaki
      8 hours ago








    • 1





      Don't know about any handy function that will do it. However, I assume that playing around trafos will lead you to the desired results.

      – tmfmnk
      8 hours ago

















    Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function using attr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM" where M=match, S=substitute, D=delete, I=insert

    – yaki
    8 hours ago







    Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function using attr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM" where M=match, S=substitute, D=delete, I=insert

    – yaki
    8 hours ago






    1




    1





    Don't know about any handy function that will do it. However, I assume that playing around trafos will lead you to the desired results.

    – tmfmnk
    8 hours ago





    Don't know about any handy function that will do it. However, I assume that playing around trafos will lead you to the desired results.

    – tmfmnk
    8 hours ago











    4














    Building off tmfmnk's answer and the suggestion to play around with the "trafos" attribute, here's a function which will show you a table of all the characters inserted or substituted, and how many times they were inserted and substituted. If you set all_actions = T it will show you matches also.



    f <- function(x, y, all_actions = FALSE){
    o <- adist(x, y, count = TRUE)
    cva <-
    list(char = strsplit(y, '')[[1]],
    action = strsplit(attr(o,"trafos"), '')[[1]])
    if(!all_actions)
    cva <- lapply(cva, '[', cva$action %in% c('I', 'S'))
    do.call(table, cva)
    }

    f(x = "abc abc", y = "abcde acc")
    # action
    # char I S
    # c 0 1
    # d 1 0
    # e 1 0

    f(x = "abc abc", y = "abcde acc", all_actions = T)
    # action
    # char I M S
    # 0 1 0
    # a 0 2 0
    # b 0 1 0
    # c 0 2 1
    # d 1 0 0
    # e 1 0 0





    share|improve this answer






























      4














      Building off tmfmnk's answer and the suggestion to play around with the "trafos" attribute, here's a function which will show you a table of all the characters inserted or substituted, and how many times they were inserted and substituted. If you set all_actions = T it will show you matches also.



      f <- function(x, y, all_actions = FALSE){
      o <- adist(x, y, count = TRUE)
      cva <-
      list(char = strsplit(y, '')[[1]],
      action = strsplit(attr(o,"trafos"), '')[[1]])
      if(!all_actions)
      cva <- lapply(cva, '[', cva$action %in% c('I', 'S'))
      do.call(table, cva)
      }

      f(x = "abc abc", y = "abcde acc")
      # action
      # char I S
      # c 0 1
      # d 1 0
      # e 1 0

      f(x = "abc abc", y = "abcde acc", all_actions = T)
      # action
      # char I M S
      # 0 1 0
      # a 0 2 0
      # b 0 1 0
      # c 0 2 1
      # d 1 0 0
      # e 1 0 0





      share|improve this answer




























        4












        4








        4







        Building off tmfmnk's answer and the suggestion to play around with the "trafos" attribute, here's a function which will show you a table of all the characters inserted or substituted, and how many times they were inserted and substituted. If you set all_actions = T it will show you matches also.



        f <- function(x, y, all_actions = FALSE){
        o <- adist(x, y, count = TRUE)
        cva <-
        list(char = strsplit(y, '')[[1]],
        action = strsplit(attr(o,"trafos"), '')[[1]])
        if(!all_actions)
        cva <- lapply(cva, '[', cva$action %in% c('I', 'S'))
        do.call(table, cva)
        }

        f(x = "abc abc", y = "abcde acc")
        # action
        # char I S
        # c 0 1
        # d 1 0
        # e 1 0

        f(x = "abc abc", y = "abcde acc", all_actions = T)
        # action
        # char I M S
        # 0 1 0
        # a 0 2 0
        # b 0 1 0
        # c 0 2 1
        # d 1 0 0
        # e 1 0 0





        share|improve this answer















        Building off tmfmnk's answer and the suggestion to play around with the "trafos" attribute, here's a function which will show you a table of all the characters inserted or substituted, and how many times they were inserted and substituted. If you set all_actions = T it will show you matches also.



        f <- function(x, y, all_actions = FALSE){
        o <- adist(x, y, count = TRUE)
        cva <-
        list(char = strsplit(y, '')[[1]],
        action = strsplit(attr(o,"trafos"), '')[[1]])
        if(!all_actions)
        cva <- lapply(cva, '[', cva$action %in% c('I', 'S'))
        do.call(table, cva)
        }

        f(x = "abc abc", y = "abcde acc")
        # action
        # char I S
        # c 0 1
        # d 1 0
        # e 1 0

        f(x = "abc abc", y = "abcde acc", all_actions = T)
        # action
        # char I M S
        # 0 1 0
        # a 0 2 0
        # b 0 1 0
        # c 0 2 1
        # d 1 0 0
        # e 1 0 0






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 5 hours ago

























        answered 7 hours ago









        IceCreamToucanIceCreamToucan

        12.7k1 gold badge8 silver badges19 bronze badges




        12.7k1 gold badge8 silver badges19 bronze badges























            3














            This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.



            Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package {Biostrings}, which is part of Bioconductor.



            Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:



            library(Biostrings)

            dist_mat = diag(27L)
            colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')

            result = pairwiseAlignment(
            "abc abc", "abcde acc",
            substitutionMatrix = dist_mat,
            gapOpening = 1, gapExtension = 1
            )


            This won’t simply give you the list c('b', 'c', 'c'), though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:



            score(result)
            # [1] 3
            aligned(result)
            as.matrix(aligned(result))
            # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
            # [1,] "a" "b" "c" "-" "-" " " "a" "b" "c"
            aligned(result)


            — For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).






            share|improve this answer
























            • Unfortunately the code above requires you to specify dist_mat manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.

              – Konrad Rudolph
              6 hours ago
















            3














            This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.



            Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package {Biostrings}, which is part of Bioconductor.



            Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:



            library(Biostrings)

            dist_mat = diag(27L)
            colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')

            result = pairwiseAlignment(
            "abc abc", "abcde acc",
            substitutionMatrix = dist_mat,
            gapOpening = 1, gapExtension = 1
            )


            This won’t simply give you the list c('b', 'c', 'c'), though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:



            score(result)
            # [1] 3
            aligned(result)
            as.matrix(aligned(result))
            # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
            # [1,] "a" "b" "c" "-" "-" " " "a" "b" "c"
            aligned(result)


            — For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).






            share|improve this answer
























            • Unfortunately the code above requires you to specify dist_mat manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.

              – Konrad Rudolph
              6 hours ago














            3












            3








            3







            This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.



            Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package {Biostrings}, which is part of Bioconductor.



            Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:



            library(Biostrings)

            dist_mat = diag(27L)
            colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')

            result = pairwiseAlignment(
            "abc abc", "abcde acc",
            substitutionMatrix = dist_mat,
            gapOpening = 1, gapExtension = 1
            )


            This won’t simply give you the list c('b', 'c', 'c'), though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:



            score(result)
            # [1] 3
            aligned(result)
            as.matrix(aligned(result))
            # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
            # [1,] "a" "b" "c" "-" "-" " " "a" "b" "c"
            aligned(result)


            — For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).






            share|improve this answer













            This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.



            Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package {Biostrings}, which is part of Bioconductor.



            Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:



            library(Biostrings)

            dist_mat = diag(27L)
            colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')

            result = pairwiseAlignment(
            "abc abc", "abcde acc",
            substitutionMatrix = dist_mat,
            gapOpening = 1, gapExtension = 1
            )


            This won’t simply give you the list c('b', 'c', 'c'), though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:



            score(result)
            # [1] 3
            aligned(result)
            as.matrix(aligned(result))
            # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
            # [1,] "a" "b" "c" "-" "-" " " "a" "b" "c"
            aligned(result)


            — For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered 6 hours ago









            Konrad RudolphKonrad Rudolph

            411k103 gold badges805 silver badges1051 bronze badges




            411k103 gold badges805 silver badges1051 bronze badges













            • Unfortunately the code above requires you to specify dist_mat manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.

              – Konrad Rudolph
              6 hours ago



















            • Unfortunately the code above requires you to specify dist_mat manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.

              – Konrad Rudolph
              6 hours ago

















            Unfortunately the code above requires you to specify dist_mat manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.

            – Konrad Rudolph
            6 hours ago





            Unfortunately the code above requires you to specify dist_mat manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.

            – Konrad Rudolph
            6 hours ago










            yaki is a new contributor. Be nice, and check out our Code of Conduct.










            draft saved

            draft discarded


















            yaki is a new contributor. Be nice, and check out our Code of Conduct.













            yaki is a new contributor. Be nice, and check out our Code of Conduct.












            yaki is a new contributor. Be nice, and check out our Code of Conduct.
















            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f56827772%2fhow-to-know-the-operations-made-to-calculate-the-levenshtein-distance-between-st%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Taj Mahal Inhaltsverzeichnis Aufbau | Geschichte | 350-Jahr-Feier | Heutige Bedeutung | Siehe auch |...

            Baia Sprie Cuprins Etimologie | Istorie | Demografie | Politică și administrație | Arii naturale...

            Nicolae Petrescu-Găină Cuprins Biografie | Opera | In memoriam | Varia | Controverse, incertitudini...