Match 4 columns and replace 1 in 2 filesMatching values within columnsCompare 2 delimited files and output...

How does the Moon's gravity affect Earth's oceans despite Earth's stronger gravitational pull?

What are the advantages of this gold finger shape?

What are these panels underneath the wing root of a A380?

How would armour (and combat) change if the fighter didn't need to actually wear it?

Setting up a Mathematical Institute of Refereeing?

What allows us to use imaginary numbers?

RAII wrapper for SQLite transactions

How do I answer an interview question about not meeting deadlines?

Escape Velocity - Won't the orbital path just become larger with higher initial velocity?

Is the Microsoft recommendation to use C# properties applicable to game development?

Why don't modern jet engines use forced exhaust mixing?

Insert or push_back to end of a std::vector?

What if a restaurant suddenly cannot accept credit cards, and the customer has no cash?

Does writing regular diary entries count as writing practice?

What is the spellcasting ability of a Barbarian Totem Warrior?

Visa on arrival to exit airport in Russia

Are there liquid fueled rocket boosters having coaxial fuel/oxidizer tanks?

Why is the battery jumpered to a resistor in this schematic?

What should we do with manuals from the 80s?

What would cause a nuclear power plant to break down after 2000 years, but not sooner?

Can anyone help me what's wrong here as i can prove 0 = 1?

100 Years of GCHQ - A quick afternoon puzzle!

Is there a word for returning to unpreparedness?

Is there a fallacy about "appeal to 'big words'"?



Match 4 columns and replace 1 in 2 files


Matching values within columnsCompare 2 delimited files and output differencesCompare two files and matched line send to new fileMerge and print matching and non matching values between a smaller file and a huge fileMatching two main columns at the same time between files, and paste supplementary columns into the output file when those main columns matchMatching 2 main columns between files; and paste other columns into the output file when those main columns match. Keep row size of 1st file intactMerging two files, one column at a timeExtract row if both column values appear in a single column from a separate fileMerging columns from 200+ big files into one tableReplacing matching entries in one column of a file by another column from a different file






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}







6















I have 2 files, and column 1 of file 1 must replace with column 2 of file 2, after column 2,3,4-5 or 5-4 (cross-match) of file 1 match with the column 1,4,5-6 or 6-5 of file 2.



file 1



SNP     Chr     Pos     EA      NEA     EAF     Beta    SE      Pvalue  Neff
1:79137 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1 533179 A G 1 -0.098 0.19 6.1e-01 185906


file 2



1       1:79033_A_G     0       79033   A       G
1 1:79137_A_T 0 79137 T A
1 1:118630_C_T 0 118630 T C
1 1:533179_A_G 0 533179 G A


I need the output to look like this:



SNP     Chr     Pos     EA      NEA     EAF     Beta    SE      Pvalue  Neff
1:79137_A_T 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033_A_G 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630_C_T 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179_A_G 1 533179 A G 1 -0.098 0.19 6.1e-01 185906


The files don't have the exact number of rows and the files are not tab-delimited. I tried the below code but it doesn't work, Can you correct my code?



awk 'NR==FNR{chr[$1]=$1;snp[$2]=$2;pos[$4]=$4;a1[$5]=$5;a2[$6]=$6;next} ($1 in chr)&&($4 in pos)&& ((($5 in a1) && ($6 in a2)) || (($6 in a1) && ($5 in a2))) {$2==snp[$2]}' file 2 file1









share|improve this question







New contributor



cookiemonster is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






















  • Do you need to preserve the exact amount of whitespace between columns, or is it OK to collapse each gap down to a single space (which awk tends to do)?

    – JigglyNaga
    yesterday











  • It is OK to collapse each gap down.

    – cookiemonster
    yesterday






  • 1





    Sounds like you might be interested in our sister site: Bioinformatics.

    – terdon
    yesterday


















6















I have 2 files, and column 1 of file 1 must replace with column 2 of file 2, after column 2,3,4-5 or 5-4 (cross-match) of file 1 match with the column 1,4,5-6 or 6-5 of file 2.



file 1



SNP     Chr     Pos     EA      NEA     EAF     Beta    SE      Pvalue  Neff
1:79137 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1 533179 A G 1 -0.098 0.19 6.1e-01 185906


file 2



1       1:79033_A_G     0       79033   A       G
1 1:79137_A_T 0 79137 T A
1 1:118630_C_T 0 118630 T C
1 1:533179_A_G 0 533179 G A


I need the output to look like this:



SNP     Chr     Pos     EA      NEA     EAF     Beta    SE      Pvalue  Neff
1:79137_A_T 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033_A_G 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630_C_T 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179_A_G 1 533179 A G 1 -0.098 0.19 6.1e-01 185906


The files don't have the exact number of rows and the files are not tab-delimited. I tried the below code but it doesn't work, Can you correct my code?



awk 'NR==FNR{chr[$1]=$1;snp[$2]=$2;pos[$4]=$4;a1[$5]=$5;a2[$6]=$6;next} ($1 in chr)&&($4 in pos)&& ((($5 in a1) && ($6 in a2)) || (($6 in a1) && ($5 in a2))) {$2==snp[$2]}' file 2 file1









share|improve this question







New contributor



cookiemonster is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






















  • Do you need to preserve the exact amount of whitespace between columns, or is it OK to collapse each gap down to a single space (which awk tends to do)?

    – JigglyNaga
    yesterday











  • It is OK to collapse each gap down.

    – cookiemonster
    yesterday






  • 1





    Sounds like you might be interested in our sister site: Bioinformatics.

    – terdon
    yesterday














6












6








6








I have 2 files, and column 1 of file 1 must replace with column 2 of file 2, after column 2,3,4-5 or 5-4 (cross-match) of file 1 match with the column 1,4,5-6 or 6-5 of file 2.



file 1



SNP     Chr     Pos     EA      NEA     EAF     Beta    SE      Pvalue  Neff
1:79137 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1 533179 A G 1 -0.098 0.19 6.1e-01 185906


file 2



1       1:79033_A_G     0       79033   A       G
1 1:79137_A_T 0 79137 T A
1 1:118630_C_T 0 118630 T C
1 1:533179_A_G 0 533179 G A


I need the output to look like this:



SNP     Chr     Pos     EA      NEA     EAF     Beta    SE      Pvalue  Neff
1:79137_A_T 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033_A_G 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630_C_T 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179_A_G 1 533179 A G 1 -0.098 0.19 6.1e-01 185906


The files don't have the exact number of rows and the files are not tab-delimited. I tried the below code but it doesn't work, Can you correct my code?



awk 'NR==FNR{chr[$1]=$1;snp[$2]=$2;pos[$4]=$4;a1[$5]=$5;a2[$6]=$6;next} ($1 in chr)&&($4 in pos)&& ((($5 in a1) && ($6 in a2)) || (($6 in a1) && ($5 in a2))) {$2==snp[$2]}' file 2 file1









share|improve this question







New contributor



cookiemonster is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











I have 2 files, and column 1 of file 1 must replace with column 2 of file 2, after column 2,3,4-5 or 5-4 (cross-match) of file 1 match with the column 1,4,5-6 or 6-5 of file 2.



file 1



SNP     Chr     Pos     EA      NEA     EAF     Beta    SE      Pvalue  Neff
1:79137 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1 533179 A G 1 -0.098 0.19 6.1e-01 185906


file 2



1       1:79033_A_G     0       79033   A       G
1 1:79137_A_T 0 79137 T A
1 1:118630_C_T 0 118630 T C
1 1:533179_A_G 0 533179 G A


I need the output to look like this:



SNP     Chr     Pos     EA      NEA     EAF     Beta    SE      Pvalue  Neff
1:79137_A_T 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033_A_G 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630_C_T 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179_A_G 1 533179 A G 1 -0.098 0.19 6.1e-01 185906


The files don't have the exact number of rows and the files are not tab-delimited. I tried the below code but it doesn't work, Can you correct my code?



awk 'NR==FNR{chr[$1]=$1;snp[$2]=$2;pos[$4]=$4;a1[$5]=$5;a2[$6]=$6;next} ($1 in chr)&&($4 in pos)&& ((($5 in a1) && ($6 in a2)) || (($6 in a1) && ($5 in a2))) {$2==snp[$2]}' file 2 file1






text-processing awk






share|improve this question







New contributor



cookiemonster is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.










share|improve this question







New contributor



cookiemonster is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








share|improve this question




share|improve this question






New contributor



cookiemonster is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








asked yesterday









cookiemonstercookiemonster

311 bronze badge




311 bronze badge




New contributor



cookiemonster is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




New contributor




cookiemonster is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.


















  • Do you need to preserve the exact amount of whitespace between columns, or is it OK to collapse each gap down to a single space (which awk tends to do)?

    – JigglyNaga
    yesterday











  • It is OK to collapse each gap down.

    – cookiemonster
    yesterday






  • 1





    Sounds like you might be interested in our sister site: Bioinformatics.

    – terdon
    yesterday



















  • Do you need to preserve the exact amount of whitespace between columns, or is it OK to collapse each gap down to a single space (which awk tends to do)?

    – JigglyNaga
    yesterday











  • It is OK to collapse each gap down.

    – cookiemonster
    yesterday






  • 1





    Sounds like you might be interested in our sister site: Bioinformatics.

    – terdon
    yesterday

















Do you need to preserve the exact amount of whitespace between columns, or is it OK to collapse each gap down to a single space (which awk tends to do)?

– JigglyNaga
yesterday





Do you need to preserve the exact amount of whitespace between columns, or is it OK to collapse each gap down to a single space (which awk tends to do)?

– JigglyNaga
yesterday













It is OK to collapse each gap down.

– cookiemonster
yesterday





It is OK to collapse each gap down.

– cookiemonster
yesterday




1




1





Sounds like you might be interested in our sister site: Bioinformatics.

– terdon
yesterday





Sounds like you might be interested in our sister site: Bioinformatics.

– terdon
yesterday










2 Answers
2






active

oldest

votes


















5














I'd do this in Perl because it has a sort function that lets us treat A T and T A as the same thing easily. For example:



$ perl -lane 'if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1]; }else{$var=join("", sort($F[2],$F[3],$F[4])); $F[1]=$name{$var} if $name{$var};print join "t", @F; } $k++ if eof' file2 file1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906


Or, slightly more legibly:



$ perl -lane 'if(!$k){
$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];
}
else{
$var=join("", sort($F[2],$F[3],$F[4]));
$F[1]=$name{$var} if $name{$var};
print join "t", @F;
}
$k++ if eof' file2 file1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906


Explanation




  • perl -lane: the -a makes perl act like awk, automatically splitting its input into the array @F on whitespace. Since perl arrays start at 0, $F[0] will be the first field, $F[1] will be the second etc. Field N is $F[N-1]. The -n makes perl read its arguments as text files and apply the script given by -e to each line of them. The -l just removes trailing newlines from each input line and adds a newline to each print call.


  • $k++ if eof : this increments the the variable $k by 1 if we've reached the end of a file (eof). We can then use if(!$k) (if $k is not defined) as an equivalent to NR==FNR in awk.


  • if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];}: if this is the first file,file2, sort fields 4,5 and 6, join them into a string and use that string as the key in the hash (associative array)name. Then, save the variant's name from file2 as the value associated with that key. The sorting lets us treatA TandT A` as equivalent.


  • else{ : if we're now reading the second file, file1.


  • $var=join("", sort($F[2],$F[3],$F[4])); : build the key. This time using fields 3, 4 and 5.


  • $F[1]=$name{$var} if $name{$var}; : set the 2nd field to the value stored in the name hash, if there is a value for this key. The if is needed to make sure we don't change the header or any other variants that might be present in file1 but not in file2.


  • print join "t", @F; : print the fields, including the change just made above.






share|improve this answer



































    4














    The join command will do the work of joining up matching lines from multiple files. But it has some requirements on its input files, so you'll need to make some temporary files along the way, with a few extra fields.



    awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }' < file1 | sort > 1.tmp
    awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}' < file2 | sort > 2.tmp

    sed q file1
    join -t % -o 1.3 2.2 1.2 1.tmp 2.tmp | sort -t % -n | awk -F % '{print $2" "$3}'


    Step by step



    Preprocessing the first file:



    awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }'


    Example output:



    1 118630 C T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311%4


    Those 3 fields, separated by %, are:




    • the "key" that has to be matched (input fields 2-5)

    • the original line, minus the first column (which is going to be replaced)

    • the original line number (so we can restore the file order after sort)


    This output is piped through sort and into a temporary file, because join requires its inputs to have been sorted.



    For the second file:



    awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}'


    Example output:



    1 118630 C T%1:118630_C_T
    1 118630 T C%1:118630_C_T


    As you specified that fields 5 and 6 should match either way round, a second line is printed with them swapped (provided that they aren't identical). The %-separated fields here are




    • the "key" to be matched

    • column 2


    Again, the output is piped through sort and into another temporary file.



    The sed q copies the first line, ie. the headers, from file1. (Unlike all the awk steps, it doesn't collapse the whitespace, so the columns are even less well aligned than in the input.)



    Then comes the main "join" step:



    join -t % -o 1.3 2.2 1.2  1.tmp 2.tmp


    The -t % sets the separator to % (rather than whitespace). The -o argument produces the following three fields of output:




    • file 1, column 3: the line number

    • file 2, column 2: the original column 2 from file2

    • file 1, column 2: the rest of the line from file1


    Example output line:



    4%1:118630_C_T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311


    Then sort can restore the original file order (sort numerically, field separator %)



    sort -t % -n


    and a final awk can remove that line number and the %s



    awk -F %  '{print $2" "$3}'


    Final output line:



    1:118630_C_T  1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311





    share|improve this answer






























      Your Answer








      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "106"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });






      cookiemonster is a new contributor. Be nice, and check out our Code of Conduct.










      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f535726%2fmatch-4-columns-and-replace-1-in-2-files%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      5














      I'd do this in Perl because it has a sort function that lets us treat A T and T A as the same thing easily. For example:



      $ perl -lane 'if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1]; }else{$var=join("", sort($F[2],$F[3],$F[4])); $F[1]=$name{$var} if $name{$var};print join "t", @F; } $k++ if eof' file2 file1
      SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
      1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
      1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
      1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
      1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906


      Or, slightly more legibly:



      $ perl -lane 'if(!$k){
      $name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];
      }
      else{
      $var=join("", sort($F[2],$F[3],$F[4]));
      $F[1]=$name{$var} if $name{$var};
      print join "t", @F;
      }
      $k++ if eof' file2 file1
      SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
      1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
      1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
      1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
      1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906


      Explanation




      • perl -lane: the -a makes perl act like awk, automatically splitting its input into the array @F on whitespace. Since perl arrays start at 0, $F[0] will be the first field, $F[1] will be the second etc. Field N is $F[N-1]. The -n makes perl read its arguments as text files and apply the script given by -e to each line of them. The -l just removes trailing newlines from each input line and adds a newline to each print call.


      • $k++ if eof : this increments the the variable $k by 1 if we've reached the end of a file (eof). We can then use if(!$k) (if $k is not defined) as an equivalent to NR==FNR in awk.


      • if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];}: if this is the first file,file2, sort fields 4,5 and 6, join them into a string and use that string as the key in the hash (associative array)name. Then, save the variant's name from file2 as the value associated with that key. The sorting lets us treatA TandT A` as equivalent.


      • else{ : if we're now reading the second file, file1.


      • $var=join("", sort($F[2],$F[3],$F[4])); : build the key. This time using fields 3, 4 and 5.


      • $F[1]=$name{$var} if $name{$var}; : set the 2nd field to the value stored in the name hash, if there is a value for this key. The if is needed to make sure we don't change the header or any other variants that might be present in file1 but not in file2.


      • print join "t", @F; : print the fields, including the change just made above.






      share|improve this answer
































        5














        I'd do this in Perl because it has a sort function that lets us treat A T and T A as the same thing easily. For example:



        $ perl -lane 'if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1]; }else{$var=join("", sort($F[2],$F[3],$F[4])); $F[1]=$name{$var} if $name{$var};print join "t", @F; } $k++ if eof' file2 file1
        SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
        1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
        1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
        1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
        1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906


        Or, slightly more legibly:



        $ perl -lane 'if(!$k){
        $name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];
        }
        else{
        $var=join("", sort($F[2],$F[3],$F[4]));
        $F[1]=$name{$var} if $name{$var};
        print join "t", @F;
        }
        $k++ if eof' file2 file1
        SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
        1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
        1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
        1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
        1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906


        Explanation




        • perl -lane: the -a makes perl act like awk, automatically splitting its input into the array @F on whitespace. Since perl arrays start at 0, $F[0] will be the first field, $F[1] will be the second etc. Field N is $F[N-1]. The -n makes perl read its arguments as text files and apply the script given by -e to each line of them. The -l just removes trailing newlines from each input line and adds a newline to each print call.


        • $k++ if eof : this increments the the variable $k by 1 if we've reached the end of a file (eof). We can then use if(!$k) (if $k is not defined) as an equivalent to NR==FNR in awk.


        • if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];}: if this is the first file,file2, sort fields 4,5 and 6, join them into a string and use that string as the key in the hash (associative array)name. Then, save the variant's name from file2 as the value associated with that key. The sorting lets us treatA TandT A` as equivalent.


        • else{ : if we're now reading the second file, file1.


        • $var=join("", sort($F[2],$F[3],$F[4])); : build the key. This time using fields 3, 4 and 5.


        • $F[1]=$name{$var} if $name{$var}; : set the 2nd field to the value stored in the name hash, if there is a value for this key. The if is needed to make sure we don't change the header or any other variants that might be present in file1 but not in file2.


        • print join "t", @F; : print the fields, including the change just made above.






        share|improve this answer






























          5












          5








          5







          I'd do this in Perl because it has a sort function that lets us treat A T and T A as the same thing easily. For example:



          $ perl -lane 'if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1]; }else{$var=join("", sort($F[2],$F[3],$F[4])); $F[1]=$name{$var} if $name{$var};print join "t", @F; } $k++ if eof' file2 file1
          SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
          1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
          1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
          1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
          1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906


          Or, slightly more legibly:



          $ perl -lane 'if(!$k){
          $name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];
          }
          else{
          $var=join("", sort($F[2],$F[3],$F[4]));
          $F[1]=$name{$var} if $name{$var};
          print join "t", @F;
          }
          $k++ if eof' file2 file1
          SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
          1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
          1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
          1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
          1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906


          Explanation




          • perl -lane: the -a makes perl act like awk, automatically splitting its input into the array @F on whitespace. Since perl arrays start at 0, $F[0] will be the first field, $F[1] will be the second etc. Field N is $F[N-1]. The -n makes perl read its arguments as text files and apply the script given by -e to each line of them. The -l just removes trailing newlines from each input line and adds a newline to each print call.


          • $k++ if eof : this increments the the variable $k by 1 if we've reached the end of a file (eof). We can then use if(!$k) (if $k is not defined) as an equivalent to NR==FNR in awk.


          • if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];}: if this is the first file,file2, sort fields 4,5 and 6, join them into a string and use that string as the key in the hash (associative array)name. Then, save the variant's name from file2 as the value associated with that key. The sorting lets us treatA TandT A` as equivalent.


          • else{ : if we're now reading the second file, file1.


          • $var=join("", sort($F[2],$F[3],$F[4])); : build the key. This time using fields 3, 4 and 5.


          • $F[1]=$name{$var} if $name{$var}; : set the 2nd field to the value stored in the name hash, if there is a value for this key. The if is needed to make sure we don't change the header or any other variants that might be present in file1 but not in file2.


          • print join "t", @F; : print the fields, including the change just made above.






          share|improve this answer















          I'd do this in Perl because it has a sort function that lets us treat A T and T A as the same thing easily. For example:



          $ perl -lane 'if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1]; }else{$var=join("", sort($F[2],$F[3],$F[4])); $F[1]=$name{$var} if $name{$var};print join "t", @F; } $k++ if eof' file2 file1
          SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
          1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
          1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
          1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
          1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906


          Or, slightly more legibly:



          $ perl -lane 'if(!$k){
          $name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];
          }
          else{
          $var=join("", sort($F[2],$F[3],$F[4]));
          $F[1]=$name{$var} if $name{$var};
          print join "t", @F;
          }
          $k++ if eof' file2 file1
          SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
          1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
          1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
          1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
          1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906


          Explanation




          • perl -lane: the -a makes perl act like awk, automatically splitting its input into the array @F on whitespace. Since perl arrays start at 0, $F[0] will be the first field, $F[1] will be the second etc. Field N is $F[N-1]. The -n makes perl read its arguments as text files and apply the script given by -e to each line of them. The -l just removes trailing newlines from each input line and adds a newline to each print call.


          • $k++ if eof : this increments the the variable $k by 1 if we've reached the end of a file (eof). We can then use if(!$k) (if $k is not defined) as an equivalent to NR==FNR in awk.


          • if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];}: if this is the first file,file2, sort fields 4,5 and 6, join them into a string and use that string as the key in the hash (associative array)name. Then, save the variant's name from file2 as the value associated with that key. The sorting lets us treatA TandT A` as equivalent.


          • else{ : if we're now reading the second file, file1.


          • $var=join("", sort($F[2],$F[3],$F[4])); : build the key. This time using fields 3, 4 and 5.


          • $F[1]=$name{$var} if $name{$var}; : set the 2nd field to the value stored in the name hash, if there is a value for this key. The if is needed to make sure we don't change the header or any other variants that might be present in file1 but not in file2.


          • print join "t", @F; : print the fields, including the change just made above.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited yesterday

























          answered yesterday









          terdonterdon

          140k34 gold badges287 silver badges466 bronze badges




          140k34 gold badges287 silver badges466 bronze badges




























              4














              The join command will do the work of joining up matching lines from multiple files. But it has some requirements on its input files, so you'll need to make some temporary files along the way, with a few extra fields.



              awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }' < file1 | sort > 1.tmp
              awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}' < file2 | sort > 2.tmp

              sed q file1
              join -t % -o 1.3 2.2 1.2 1.tmp 2.tmp | sort -t % -n | awk -F % '{print $2" "$3}'


              Step by step



              Preprocessing the first file:



              awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }'


              Example output:



              1 118630 C T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311%4


              Those 3 fields, separated by %, are:




              • the "key" that has to be matched (input fields 2-5)

              • the original line, minus the first column (which is going to be replaced)

              • the original line number (so we can restore the file order after sort)


              This output is piped through sort and into a temporary file, because join requires its inputs to have been sorted.



              For the second file:



              awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}'


              Example output:



              1 118630 C T%1:118630_C_T
              1 118630 T C%1:118630_C_T


              As you specified that fields 5 and 6 should match either way round, a second line is printed with them swapped (provided that they aren't identical). The %-separated fields here are




              • the "key" to be matched

              • column 2


              Again, the output is piped through sort and into another temporary file.



              The sed q copies the first line, ie. the headers, from file1. (Unlike all the awk steps, it doesn't collapse the whitespace, so the columns are even less well aligned than in the input.)



              Then comes the main "join" step:



              join -t % -o 1.3 2.2 1.2  1.tmp 2.tmp


              The -t % sets the separator to % (rather than whitespace). The -o argument produces the following three fields of output:




              • file 1, column 3: the line number

              • file 2, column 2: the original column 2 from file2

              • file 1, column 2: the rest of the line from file1


              Example output line:



              4%1:118630_C_T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311


              Then sort can restore the original file order (sort numerically, field separator %)



              sort -t % -n


              and a final awk can remove that line number and the %s



              awk -F %  '{print $2" "$3}'


              Final output line:



              1:118630_C_T  1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311





              share|improve this answer
































                4














                The join command will do the work of joining up matching lines from multiple files. But it has some requirements on its input files, so you'll need to make some temporary files along the way, with a few extra fields.



                awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }' < file1 | sort > 1.tmp
                awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}' < file2 | sort > 2.tmp

                sed q file1
                join -t % -o 1.3 2.2 1.2 1.tmp 2.tmp | sort -t % -n | awk -F % '{print $2" "$3}'


                Step by step



                Preprocessing the first file:



                awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }'


                Example output:



                1 118630 C T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311%4


                Those 3 fields, separated by %, are:




                • the "key" that has to be matched (input fields 2-5)

                • the original line, minus the first column (which is going to be replaced)

                • the original line number (so we can restore the file order after sort)


                This output is piped through sort and into a temporary file, because join requires its inputs to have been sorted.



                For the second file:



                awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}'


                Example output:



                1 118630 C T%1:118630_C_T
                1 118630 T C%1:118630_C_T


                As you specified that fields 5 and 6 should match either way round, a second line is printed with them swapped (provided that they aren't identical). The %-separated fields here are




                • the "key" to be matched

                • column 2


                Again, the output is piped through sort and into another temporary file.



                The sed q copies the first line, ie. the headers, from file1. (Unlike all the awk steps, it doesn't collapse the whitespace, so the columns are even less well aligned than in the input.)



                Then comes the main "join" step:



                join -t % -o 1.3 2.2 1.2  1.tmp 2.tmp


                The -t % sets the separator to % (rather than whitespace). The -o argument produces the following three fields of output:




                • file 1, column 3: the line number

                • file 2, column 2: the original column 2 from file2

                • file 1, column 2: the rest of the line from file1


                Example output line:



                4%1:118630_C_T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311


                Then sort can restore the original file order (sort numerically, field separator %)



                sort -t % -n


                and a final awk can remove that line number and the %s



                awk -F %  '{print $2" "$3}'


                Final output line:



                1:118630_C_T  1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311





                share|improve this answer






























                  4












                  4








                  4







                  The join command will do the work of joining up matching lines from multiple files. But it has some requirements on its input files, so you'll need to make some temporary files along the way, with a few extra fields.



                  awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }' < file1 | sort > 1.tmp
                  awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}' < file2 | sort > 2.tmp

                  sed q file1
                  join -t % -o 1.3 2.2 1.2 1.tmp 2.tmp | sort -t % -n | awk -F % '{print $2" "$3}'


                  Step by step



                  Preprocessing the first file:



                  awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }'


                  Example output:



                  1 118630 C T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311%4


                  Those 3 fields, separated by %, are:




                  • the "key" that has to be matched (input fields 2-5)

                  • the original line, minus the first column (which is going to be replaced)

                  • the original line number (so we can restore the file order after sort)


                  This output is piped through sort and into a temporary file, because join requires its inputs to have been sorted.



                  For the second file:



                  awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}'


                  Example output:



                  1 118630 C T%1:118630_C_T
                  1 118630 T C%1:118630_C_T


                  As you specified that fields 5 and 6 should match either way round, a second line is printed with them swapped (provided that they aren't identical). The %-separated fields here are




                  • the "key" to be matched

                  • column 2


                  Again, the output is piped through sort and into another temporary file.



                  The sed q copies the first line, ie. the headers, from file1. (Unlike all the awk steps, it doesn't collapse the whitespace, so the columns are even less well aligned than in the input.)



                  Then comes the main "join" step:



                  join -t % -o 1.3 2.2 1.2  1.tmp 2.tmp


                  The -t % sets the separator to % (rather than whitespace). The -o argument produces the following three fields of output:




                  • file 1, column 3: the line number

                  • file 2, column 2: the original column 2 from file2

                  • file 1, column 2: the rest of the line from file1


                  Example output line:



                  4%1:118630_C_T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311


                  Then sort can restore the original file order (sort numerically, field separator %)



                  sort -t % -n


                  and a final awk can remove that line number and the %s



                  awk -F %  '{print $2" "$3}'


                  Final output line:



                  1:118630_C_T  1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311





                  share|improve this answer















                  The join command will do the work of joining up matching lines from multiple files. But it has some requirements on its input files, so you'll need to make some temporary files along the way, with a few extra fields.



                  awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }' < file1 | sort > 1.tmp
                  awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}' < file2 | sort > 2.tmp

                  sed q file1
                  join -t % -o 1.3 2.2 1.2 1.tmp 2.tmp | sort -t % -n | awk -F % '{print $2" "$3}'


                  Step by step



                  Preprocessing the first file:



                  awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }'


                  Example output:



                  1 118630 C T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311%4


                  Those 3 fields, separated by %, are:




                  • the "key" that has to be matched (input fields 2-5)

                  • the original line, minus the first column (which is going to be replaced)

                  • the original line number (so we can restore the file order after sort)


                  This output is piped through sort and into a temporary file, because join requires its inputs to have been sorted.



                  For the second file:



                  awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}'


                  Example output:



                  1 118630 C T%1:118630_C_T
                  1 118630 T C%1:118630_C_T


                  As you specified that fields 5 and 6 should match either way round, a second line is printed with them swapped (provided that they aren't identical). The %-separated fields here are




                  • the "key" to be matched

                  • column 2


                  Again, the output is piped through sort and into another temporary file.



                  The sed q copies the first line, ie. the headers, from file1. (Unlike all the awk steps, it doesn't collapse the whitespace, so the columns are even less well aligned than in the input.)



                  Then comes the main "join" step:



                  join -t % -o 1.3 2.2 1.2  1.tmp 2.tmp


                  The -t % sets the separator to % (rather than whitespace). The -o argument produces the following three fields of output:




                  • file 1, column 3: the line number

                  • file 2, column 2: the original column 2 from file2

                  • file 1, column 2: the rest of the line from file1


                  Example output line:



                  4%1:118630_C_T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311


                  Then sort can restore the original file order (sort numerically, field separator %)



                  sort -t % -n


                  and a final awk can remove that line number and the %s



                  awk -F %  '{print $2" "$3}'


                  Final output line:



                  1:118630_C_T  1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited yesterday

























                  answered yesterday









                  JigglyNagaJigglyNaga

                  4,41210 silver badges37 bronze badges




                  4,41210 silver badges37 bronze badges

























                      cookiemonster is a new contributor. Be nice, and check out our Code of Conduct.










                      draft saved

                      draft discarded


















                      cookiemonster is a new contributor. Be nice, and check out our Code of Conduct.













                      cookiemonster is a new contributor. Be nice, and check out our Code of Conduct.












                      cookiemonster is a new contributor. Be nice, and check out our Code of Conduct.
















                      Thanks for contributing an answer to Unix & Linux Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f535726%2fmatch-4-columns-and-replace-1-in-2-files%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Taj Mahal Inhaltsverzeichnis Aufbau | Geschichte | 350-Jahr-Feier | Heutige Bedeutung | Siehe auch |...

                      Baia Sprie Cuprins Etimologie | Istorie | Demografie | Politică și administrație | Arii naturale...

                      Nicolae Petrescu-Găină Cuprins Biografie | Opera | In memoriam | Varia | Controverse, incertitudini...