how to merge two files to skip duplicate dataCompare two files with first column and remove duplicate row...

A high quality contribution but an annoying error is present in my published article

Leaving a job that I just took based on false promise of a raise. What do I tell future interviewers?

What are the benefits and disadvantages if a creature has multiple tails, e.g., Kyuubi or Nekomata?

The quicker I go up, the sooner I’ll go down - Riddle

Is it right to extend flaps only in the white arc?

Is it really necessary to have a four hour meeting in Sprint planning?

Safely hang a mirror that does not have hooks

I reverse the source code, you negate the output!

How much damage can be done just by heating matter?

What are these ingforms of learning?

Do all creatures have souls?

What is this utensil for?

Can this word order be rearranged?

How to make interviewee comfortable interviewing in lounge chairs

Resolving moral conflict

Where Does VDD+0.3V Input Limit Come From on IC chips?

Can Northern Ireland's border issue be solved by repartition?

How can I repair this gas leak on my new range? Teflon tape isn't working

Is the mass of paint relevant in rocket design?

How can an attacker use robots.txt?

Conditionally execute a command if a specific package is loaded

Hiking with a mule or two?

What is the meaning of "heutig" in this sentence?

Is it a good idea to leave minor world details to the reader's imagination?



how to merge two files to skip duplicate data


Compare two files with first column and remove duplicate row from 2nd file in shell scriptHow to merge two files based on the matching of two columns?Merge some tab-delimited filesHow to merge two files in the same row?Merge two fileshow to merge two files based on single columnShuffling two files and merge datatwo input files data processingMerge two data from two columns in 100+ files into one separate filehow can I merge two text files together






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}







0















I have two different large files (each one more than 300,000 lines) and I want to combine them in a specific way. Some rows of the two data measure the same thing; when columns 9, 14, 15, 16, 17 are reciprocally equal, I suppose they are measuring the same thing and I want to output from file1 and skip the file2 rows in order to skip duplicate data. Otherwise, I want to output all rows from both files. The other columns in the duplicate rows may not be equal reciprocally, and the precision of file 1 is better than file 2, and that is the reason for choosing rows from file1 rather than file 2.



For example, since columns 9, 14, 15, 16, and 17 of the three first lines of the two following files are reciprocally equal, so the first three lines of the two datasets measure the same thing and thus I want to output from File 1 and skip the File2 data. For the fourth lines of the datasets, since columns 14 of the two files are not reciprocally equal. I output both lines from two files.



I used this script



awk '!seen[$9,$14,$15,$16,$17]++' File1 File2 > output  


, which works well for small data. However, when I use it for large data sets, it skips some not duplicate data and I dont know the reason. It is deeply appreciated if anyone could help which script I can use for merging the datasets.



File1



28  208   48  198  1110   2.04   33 0.0   34.40 3.3 0.0 0   8.0 1985  1  1 SMO1   -9 -9 -9  
24 102 26 99 2100 2.61 129 0.0 42.90 3.3 0.0 0 8.0 1985 1 1 EYA -9 -9 -9
89 294 26 106 1162 4.54 -115 0.0 70.80 3.3 0.0 0 8.0 1985 1 1 GYA -9 -9 -9
38 88 41 86 1100 3.50 155 0.0 56.30 3.8 0.0 0 10.0 1985 1 1 KOL1 -9 -9 -9


File2



24  102   22  100  1110   2.04   33 0.0   34.40 3.3 0.0 0   8.0 1985  1  1 SMO1   -9 -9 -9  
24 102 26 99 2100 2.61 129 0.0 42.90 3.3 0.0 0 8.0 1985 1 1 EYA -9 -9 -9
24 102 26 106 1162 4.54 -115 0.0 70.80 3.3 0.0 0 8.0 1985 1 1 GYA -9 -9 -9
38 88 41 86 1100 3.50 155 0.0 56.30 3.8 0.0 0 10.0 2000 1 1 KOL1 -9 -9 -9


output



28  208   48  198  1110   2.04   33 0.0   34.40 3.3 0.0 0   8.0 1985  1  1 SMO1   -9 -9 -9  
24 102 26 99 2100 2.61 129 0.0 42.90 3.3 0.0 0 8.0 1985 1 1 EYA -9 -9 -9
89 294 26 106 1162 4.54 -115 0.0 70.80 3.3 0.0 0 8.0 1985 1 1 GYA -9 -9 -9
38 88 41 86 1100 3.50 155 0.0 56.30 3.8 0.0 0 10.0 1985 1 1 KOL1 -9 -9 -9
38 88 41 86 1100 3.50 155 0.0 56.30 3.8 0.0 0 10.0 2000 1 1 KOL1 -9 -9 -9









share|improve this question

































    0















    I have two different large files (each one more than 300,000 lines) and I want to combine them in a specific way. Some rows of the two data measure the same thing; when columns 9, 14, 15, 16, 17 are reciprocally equal, I suppose they are measuring the same thing and I want to output from file1 and skip the file2 rows in order to skip duplicate data. Otherwise, I want to output all rows from both files. The other columns in the duplicate rows may not be equal reciprocally, and the precision of file 1 is better than file 2, and that is the reason for choosing rows from file1 rather than file 2.



    For example, since columns 9, 14, 15, 16, and 17 of the three first lines of the two following files are reciprocally equal, so the first three lines of the two datasets measure the same thing and thus I want to output from File 1 and skip the File2 data. For the fourth lines of the datasets, since columns 14 of the two files are not reciprocally equal. I output both lines from two files.



    I used this script



    awk '!seen[$9,$14,$15,$16,$17]++' File1 File2 > output  


    , which works well for small data. However, when I use it for large data sets, it skips some not duplicate data and I dont know the reason. It is deeply appreciated if anyone could help which script I can use for merging the datasets.



    File1



    28  208   48  198  1110   2.04   33 0.0   34.40 3.3 0.0 0   8.0 1985  1  1 SMO1   -9 -9 -9  
    24 102 26 99 2100 2.61 129 0.0 42.90 3.3 0.0 0 8.0 1985 1 1 EYA -9 -9 -9
    89 294 26 106 1162 4.54 -115 0.0 70.80 3.3 0.0 0 8.0 1985 1 1 GYA -9 -9 -9
    38 88 41 86 1100 3.50 155 0.0 56.30 3.8 0.0 0 10.0 1985 1 1 KOL1 -9 -9 -9


    File2



    24  102   22  100  1110   2.04   33 0.0   34.40 3.3 0.0 0   8.0 1985  1  1 SMO1   -9 -9 -9  
    24 102 26 99 2100 2.61 129 0.0 42.90 3.3 0.0 0 8.0 1985 1 1 EYA -9 -9 -9
    24 102 26 106 1162 4.54 -115 0.0 70.80 3.3 0.0 0 8.0 1985 1 1 GYA -9 -9 -9
    38 88 41 86 1100 3.50 155 0.0 56.30 3.8 0.0 0 10.0 2000 1 1 KOL1 -9 -9 -9


    output



    28  208   48  198  1110   2.04   33 0.0   34.40 3.3 0.0 0   8.0 1985  1  1 SMO1   -9 -9 -9  
    24 102 26 99 2100 2.61 129 0.0 42.90 3.3 0.0 0 8.0 1985 1 1 EYA -9 -9 -9
    89 294 26 106 1162 4.54 -115 0.0 70.80 3.3 0.0 0 8.0 1985 1 1 GYA -9 -9 -9
    38 88 41 86 1100 3.50 155 0.0 56.30 3.8 0.0 0 10.0 1985 1 1 KOL1 -9 -9 -9
    38 88 41 86 1100 3.50 155 0.0 56.30 3.8 0.0 0 10.0 2000 1 1 KOL1 -9 -9 -9









    share|improve this question





























      0












      0








      0








      I have two different large files (each one more than 300,000 lines) and I want to combine them in a specific way. Some rows of the two data measure the same thing; when columns 9, 14, 15, 16, 17 are reciprocally equal, I suppose they are measuring the same thing and I want to output from file1 and skip the file2 rows in order to skip duplicate data. Otherwise, I want to output all rows from both files. The other columns in the duplicate rows may not be equal reciprocally, and the precision of file 1 is better than file 2, and that is the reason for choosing rows from file1 rather than file 2.



      For example, since columns 9, 14, 15, 16, and 17 of the three first lines of the two following files are reciprocally equal, so the first three lines of the two datasets measure the same thing and thus I want to output from File 1 and skip the File2 data. For the fourth lines of the datasets, since columns 14 of the two files are not reciprocally equal. I output both lines from two files.



      I used this script



      awk '!seen[$9,$14,$15,$16,$17]++' File1 File2 > output  


      , which works well for small data. However, when I use it for large data sets, it skips some not duplicate data and I dont know the reason. It is deeply appreciated if anyone could help which script I can use for merging the datasets.



      File1



      28  208   48  198  1110   2.04   33 0.0   34.40 3.3 0.0 0   8.0 1985  1  1 SMO1   -9 -9 -9  
      24 102 26 99 2100 2.61 129 0.0 42.90 3.3 0.0 0 8.0 1985 1 1 EYA -9 -9 -9
      89 294 26 106 1162 4.54 -115 0.0 70.80 3.3 0.0 0 8.0 1985 1 1 GYA -9 -9 -9
      38 88 41 86 1100 3.50 155 0.0 56.30 3.8 0.0 0 10.0 1985 1 1 KOL1 -9 -9 -9


      File2



      24  102   22  100  1110   2.04   33 0.0   34.40 3.3 0.0 0   8.0 1985  1  1 SMO1   -9 -9 -9  
      24 102 26 99 2100 2.61 129 0.0 42.90 3.3 0.0 0 8.0 1985 1 1 EYA -9 -9 -9
      24 102 26 106 1162 4.54 -115 0.0 70.80 3.3 0.0 0 8.0 1985 1 1 GYA -9 -9 -9
      38 88 41 86 1100 3.50 155 0.0 56.30 3.8 0.0 0 10.0 2000 1 1 KOL1 -9 -9 -9


      output



      28  208   48  198  1110   2.04   33 0.0   34.40 3.3 0.0 0   8.0 1985  1  1 SMO1   -9 -9 -9  
      24 102 26 99 2100 2.61 129 0.0 42.90 3.3 0.0 0 8.0 1985 1 1 EYA -9 -9 -9
      89 294 26 106 1162 4.54 -115 0.0 70.80 3.3 0.0 0 8.0 1985 1 1 GYA -9 -9 -9
      38 88 41 86 1100 3.50 155 0.0 56.30 3.8 0.0 0 10.0 1985 1 1 KOL1 -9 -9 -9
      38 88 41 86 1100 3.50 155 0.0 56.30 3.8 0.0 0 10.0 2000 1 1 KOL1 -9 -9 -9









      share|improve this question
















      I have two different large files (each one more than 300,000 lines) and I want to combine them in a specific way. Some rows of the two data measure the same thing; when columns 9, 14, 15, 16, 17 are reciprocally equal, I suppose they are measuring the same thing and I want to output from file1 and skip the file2 rows in order to skip duplicate data. Otherwise, I want to output all rows from both files. The other columns in the duplicate rows may not be equal reciprocally, and the precision of file 1 is better than file 2, and that is the reason for choosing rows from file1 rather than file 2.



      For example, since columns 9, 14, 15, 16, and 17 of the three first lines of the two following files are reciprocally equal, so the first three lines of the two datasets measure the same thing and thus I want to output from File 1 and skip the File2 data. For the fourth lines of the datasets, since columns 14 of the two files are not reciprocally equal. I output both lines from two files.



      I used this script



      awk '!seen[$9,$14,$15,$16,$17]++' File1 File2 > output  


      , which works well for small data. However, when I use it for large data sets, it skips some not duplicate data and I dont know the reason. It is deeply appreciated if anyone could help which script I can use for merging the datasets.



      File1



      28  208   48  198  1110   2.04   33 0.0   34.40 3.3 0.0 0   8.0 1985  1  1 SMO1   -9 -9 -9  
      24 102 26 99 2100 2.61 129 0.0 42.90 3.3 0.0 0 8.0 1985 1 1 EYA -9 -9 -9
      89 294 26 106 1162 4.54 -115 0.0 70.80 3.3 0.0 0 8.0 1985 1 1 GYA -9 -9 -9
      38 88 41 86 1100 3.50 155 0.0 56.30 3.8 0.0 0 10.0 1985 1 1 KOL1 -9 -9 -9


      File2



      24  102   22  100  1110   2.04   33 0.0   34.40 3.3 0.0 0   8.0 1985  1  1 SMO1   -9 -9 -9  
      24 102 26 99 2100 2.61 129 0.0 42.90 3.3 0.0 0 8.0 1985 1 1 EYA -9 -9 -9
      24 102 26 106 1162 4.54 -115 0.0 70.80 3.3 0.0 0 8.0 1985 1 1 GYA -9 -9 -9
      38 88 41 86 1100 3.50 155 0.0 56.30 3.8 0.0 0 10.0 2000 1 1 KOL1 -9 -9 -9


      output



      28  208   48  198  1110   2.04   33 0.0   34.40 3.3 0.0 0   8.0 1985  1  1 SMO1   -9 -9 -9  
      24 102 26 99 2100 2.61 129 0.0 42.90 3.3 0.0 0 8.0 1985 1 1 EYA -9 -9 -9
      89 294 26 106 1162 4.54 -115 0.0 70.80 3.3 0.0 0 8.0 1985 1 1 GYA -9 -9 -9
      38 88 41 86 1100 3.50 155 0.0 56.30 3.8 0.0 0 10.0 1985 1 1 KOL1 -9 -9 -9
      38 88 41 86 1100 3.50 155 0.0 56.30 3.8 0.0 0 10.0 2000 1 1 KOL1 -9 -9 -9






      shell-script text-processing awk gawk join






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 7 mins ago









      Freddy

      7,9431 gold badge6 silver badges29 bronze badges




      7,9431 gold badge6 silver badges29 bronze badges










      asked 58 mins ago









      EsiEsi

      53 bronze badges




      53 bronze badges

























          0






          active

          oldest

          votes














          Your Answer








          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "106"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });















          draft saved

          draft discarded
















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f542732%2fhow-to-merge-two-files-to-skip-duplicate-data%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes

















          draft saved

          draft discarded



















































          Thanks for contributing an answer to Unix & Linux Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f542732%2fhow-to-merge-two-files-to-skip-duplicate-data%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown