Match 4 columns and replace 1 in 2 filesMatching values within columnsCompare 2 delimited files and output...
How does the Moon's gravity affect Earth's oceans despite Earth's stronger gravitational pull?
What are the advantages of this gold finger shape?
What are these panels underneath the wing root of a A380?
How would armour (and combat) change if the fighter didn't need to actually wear it?
Setting up a Mathematical Institute of Refereeing?
What allows us to use imaginary numbers?
RAII wrapper for SQLite transactions
How do I answer an interview question about not meeting deadlines?
Escape Velocity - Won't the orbital path just become larger with higher initial velocity?
Is the Microsoft recommendation to use C# properties applicable to game development?
Why don't modern jet engines use forced exhaust mixing?
Insert or push_back to end of a std::vector?
What if a restaurant suddenly cannot accept credit cards, and the customer has no cash?
Does writing regular diary entries count as writing practice?
What is the spellcasting ability of a Barbarian Totem Warrior?
Visa on arrival to exit airport in Russia
Are there liquid fueled rocket boosters having coaxial fuel/oxidizer tanks?
Why is the battery jumpered to a resistor in this schematic?
What should we do with manuals from the 80s?
What would cause a nuclear power plant to break down after 2000 years, but not sooner?
Can anyone help me what's wrong here as i can prove 0 = 1?
100 Years of GCHQ - A quick afternoon puzzle!
Is there a word for returning to unpreparedness?
Is there a fallacy about "appeal to 'big words'"?
Match 4 columns and replace 1 in 2 files
Matching values within columnsCompare 2 delimited files and output differencesCompare two files and matched line send to new fileMerge and print matching and non matching values between a smaller file and a huge fileMatching two main columns at the same time between files, and paste supplementary columns into the output file when those main columns matchMatching 2 main columns between files; and paste other columns into the output file when those main columns match. Keep row size of 1st file intactMerging two files, one column at a timeExtract row if both column values appear in a single column from a separate fileMerging columns from 200+ big files into one tableReplacing matching entries in one column of a file by another column from a different file
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}
I have 2 files, and column 1 of file 1 must replace with column 2 of file 2, after column 2,3,4-5 or 5-4 (cross-match) of file 1 match with the column 1,4,5-6 or 6-5 of file 2.
file 1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1 533179 A G 1 -0.098 0.19 6.1e-01 185906
file 2
1 1:79033_A_G 0 79033 A G
1 1:79137_A_T 0 79137 T A
1 1:118630_C_T 0 118630 T C
1 1:533179_A_G 0 533179 G A
I need the output to look like this:
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137_A_T 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033_A_G 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630_C_T 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179_A_G 1 533179 A G 1 -0.098 0.19 6.1e-01 185906
The files don't have the exact number of rows and the files are not tab-delimited. I tried the below code but it doesn't work, Can you correct my code?
awk 'NR==FNR{chr[$1]=$1;snp[$2]=$2;pos[$4]=$4;a1[$5]=$5;a2[$6]=$6;next} ($1 in chr)&&($4 in pos)&& ((($5 in a1) && ($6 in a2)) || (($6 in a1) && ($5 in a2))) {$2==snp[$2]}' file 2 file1
text-processing awk
New contributor
add a comment |
I have 2 files, and column 1 of file 1 must replace with column 2 of file 2, after column 2,3,4-5 or 5-4 (cross-match) of file 1 match with the column 1,4,5-6 or 6-5 of file 2.
file 1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1 533179 A G 1 -0.098 0.19 6.1e-01 185906
file 2
1 1:79033_A_G 0 79033 A G
1 1:79137_A_T 0 79137 T A
1 1:118630_C_T 0 118630 T C
1 1:533179_A_G 0 533179 G A
I need the output to look like this:
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137_A_T 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033_A_G 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630_C_T 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179_A_G 1 533179 A G 1 -0.098 0.19 6.1e-01 185906
The files don't have the exact number of rows and the files are not tab-delimited. I tried the below code but it doesn't work, Can you correct my code?
awk 'NR==FNR{chr[$1]=$1;snp[$2]=$2;pos[$4]=$4;a1[$5]=$5;a2[$6]=$6;next} ($1 in chr)&&($4 in pos)&& ((($5 in a1) && ($6 in a2)) || (($6 in a1) && ($5 in a2))) {$2==snp[$2]}' file 2 file1
text-processing awk
New contributor
Do you need to preserve the exact amount of whitespace between columns, or is it OK to collapse each gap down to a single space (whichawk
tends to do)?
– JigglyNaga
yesterday
It is OK to collapse each gap down.
– cookiemonster
yesterday
1
Sounds like you might be interested in our sister site: Bioinformatics.
– terdon♦
yesterday
add a comment |
I have 2 files, and column 1 of file 1 must replace with column 2 of file 2, after column 2,3,4-5 or 5-4 (cross-match) of file 1 match with the column 1,4,5-6 or 6-5 of file 2.
file 1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1 533179 A G 1 -0.098 0.19 6.1e-01 185906
file 2
1 1:79033_A_G 0 79033 A G
1 1:79137_A_T 0 79137 T A
1 1:118630_C_T 0 118630 T C
1 1:533179_A_G 0 533179 G A
I need the output to look like this:
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137_A_T 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033_A_G 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630_C_T 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179_A_G 1 533179 A G 1 -0.098 0.19 6.1e-01 185906
The files don't have the exact number of rows and the files are not tab-delimited. I tried the below code but it doesn't work, Can you correct my code?
awk 'NR==FNR{chr[$1]=$1;snp[$2]=$2;pos[$4]=$4;a1[$5]=$5;a2[$6]=$6;next} ($1 in chr)&&($4 in pos)&& ((($5 in a1) && ($6 in a2)) || (($6 in a1) && ($5 in a2))) {$2==snp[$2]}' file 2 file1
text-processing awk
New contributor
I have 2 files, and column 1 of file 1 must replace with column 2 of file 2, after column 2,3,4-5 or 5-4 (cross-match) of file 1 match with the column 1,4,5-6 or 6-5 of file 2.
file 1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1 533179 A G 1 -0.098 0.19 6.1e-01 185906
file 2
1 1:79033_A_G 0 79033 A G
1 1:79137_A_T 0 79137 T A
1 1:118630_C_T 0 118630 T C
1 1:533179_A_G 0 533179 G A
I need the output to look like this:
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137_A_T 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033_A_G 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630_C_T 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179_A_G 1 533179 A G 1 -0.098 0.19 6.1e-01 185906
The files don't have the exact number of rows and the files are not tab-delimited. I tried the below code but it doesn't work, Can you correct my code?
awk 'NR==FNR{chr[$1]=$1;snp[$2]=$2;pos[$4]=$4;a1[$5]=$5;a2[$6]=$6;next} ($1 in chr)&&($4 in pos)&& ((($5 in a1) && ($6 in a2)) || (($6 in a1) && ($5 in a2))) {$2==snp[$2]}' file 2 file1
text-processing awk
text-processing awk
New contributor
New contributor
New contributor
asked yesterday
cookiemonstercookiemonster
311 bronze badge
311 bronze badge
New contributor
New contributor
Do you need to preserve the exact amount of whitespace between columns, or is it OK to collapse each gap down to a single space (whichawk
tends to do)?
– JigglyNaga
yesterday
It is OK to collapse each gap down.
– cookiemonster
yesterday
1
Sounds like you might be interested in our sister site: Bioinformatics.
– terdon♦
yesterday
add a comment |
Do you need to preserve the exact amount of whitespace between columns, or is it OK to collapse each gap down to a single space (whichawk
tends to do)?
– JigglyNaga
yesterday
It is OK to collapse each gap down.
– cookiemonster
yesterday
1
Sounds like you might be interested in our sister site: Bioinformatics.
– terdon♦
yesterday
Do you need to preserve the exact amount of whitespace between columns, or is it OK to collapse each gap down to a single space (which
awk
tends to do)?– JigglyNaga
yesterday
Do you need to preserve the exact amount of whitespace between columns, or is it OK to collapse each gap down to a single space (which
awk
tends to do)?– JigglyNaga
yesterday
It is OK to collapse each gap down.
– cookiemonster
yesterday
It is OK to collapse each gap down.
– cookiemonster
yesterday
1
1
Sounds like you might be interested in our sister site: Bioinformatics.
– terdon♦
yesterday
Sounds like you might be interested in our sister site: Bioinformatics.
– terdon♦
yesterday
add a comment |
2 Answers
2
active
oldest
votes
I'd do this in Perl because it has a sort
function that lets us treat A T
and T A
as the same thing easily. For example:
$ perl -lane 'if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1]; }else{$var=join("", sort($F[2],$F[3],$F[4])); $F[1]=$name{$var} if $name{$var};print join "t", @F; } $k++ if eof' file2 file1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906
Or, slightly more legibly:
$ perl -lane 'if(!$k){
$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];
}
else{
$var=join("", sort($F[2],$F[3],$F[4]));
$F[1]=$name{$var} if $name{$var};
print join "t", @F;
}
$k++ if eof' file2 file1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906
Explanation
perl -lane
: the-a
makes perl act like awk, automatically splitting its input into the array@F
on whitespace. Since perl arrays start at0
,$F[0]
will be the first field,$F[1]
will be the second etc. Field N is$F[N-1]
. The-n
makes perl read its arguments as text files and apply the script given by-e
to each line of them. The-l
just removes trailing newlines from each input line and adds a newline to eachprint
call.$k++ if eof
: this increments the the variable$k
by 1 if we've reached the end of a file (eof
). We can then useif(!$k)
(if $k is not defined) as an equivalent toNR==FNR
in awk.- if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];}
: if this is the first file,
file2, sort fields 4,5 and 6, join them into a string and use that string as the key in the hash (associative array)
name. Then, save the variant's name from file2 as the value associated with that key. The sorting lets us treat
A Tand
T A` as equivalent.
else{
: if we're now reading the second file,file1
.
$var=join("", sort($F[2],$F[3],$F[4]));
: build the key. This time using fields 3, 4 and 5.
$F[1]=$name{$var} if $name{$var};
: set the 2nd field to the value stored in thename
hash, if there is a value for this key. Theif
is needed to make sure we don't change the header or any other variants that might be present infile1
but not infile2
.
print join "t", @F;
: print the fields, including the change just made above.
add a comment |
The join
command will do the work of joining up matching lines from multiple files. But it has some requirements on its input files, so you'll need to make some temporary files along the way, with a few extra fields.
awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }' < file1 | sort > 1.tmp
awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}' < file2 | sort > 2.tmp
sed q file1
join -t % -o 1.3 2.2 1.2 1.tmp 2.tmp | sort -t % -n | awk -F % '{print $2" "$3}'
Step by step
Preprocessing the first file:
awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }'
Example output:
1 118630 C T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311%4
Those 3 fields, separated by %
, are:
- the "key" that has to be matched (input fields 2-5)
- the original line, minus the first column (which is going to be replaced)
- the original line number (so we can restore the file order after
sort
)
This output is piped through sort
and into a temporary file, because join
requires its inputs to have been sorted.
For the second file:
awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}'
Example output:
1 118630 C T%1:118630_C_T
1 118630 T C%1:118630_C_T
As you specified that fields 5 and 6 should match either way round, a second line is printed with them swapped (provided that they aren't identical). The %
-separated fields here are
- the "key" to be matched
- column 2
Again, the output is piped through sort
and into another temporary file.
The sed q
copies the first line, ie. the headers, from file1
. (Unlike all the awk
steps, it doesn't collapse the whitespace, so the columns are even less well aligned than in the input.)
Then comes the main "join" step:
join -t % -o 1.3 2.2 1.2 1.tmp 2.tmp
The -t %
sets the separator to %
(rather than whitespace). The -o
argument produces the following three fields of output:
- file 1, column 3: the line number
- file 2, column 2: the original column 2 from
file2
- file 1, column 2: the rest of the line from
file1
Example output line:
4%1:118630_C_T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
Then sort
can restore the original file order (sort numerically, field separator %
)
sort -t % -n
and a final awk
can remove that line number and the %
s
awk -F % '{print $2" "$3}'
Final output line:
1:118630_C_T 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
cookiemonster is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f535726%2fmatch-4-columns-and-replace-1-in-2-files%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
I'd do this in Perl because it has a sort
function that lets us treat A T
and T A
as the same thing easily. For example:
$ perl -lane 'if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1]; }else{$var=join("", sort($F[2],$F[3],$F[4])); $F[1]=$name{$var} if $name{$var};print join "t", @F; } $k++ if eof' file2 file1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906
Or, slightly more legibly:
$ perl -lane 'if(!$k){
$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];
}
else{
$var=join("", sort($F[2],$F[3],$F[4]));
$F[1]=$name{$var} if $name{$var};
print join "t", @F;
}
$k++ if eof' file2 file1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906
Explanation
perl -lane
: the-a
makes perl act like awk, automatically splitting its input into the array@F
on whitespace. Since perl arrays start at0
,$F[0]
will be the first field,$F[1]
will be the second etc. Field N is$F[N-1]
. The-n
makes perl read its arguments as text files and apply the script given by-e
to each line of them. The-l
just removes trailing newlines from each input line and adds a newline to eachprint
call.$k++ if eof
: this increments the the variable$k
by 1 if we've reached the end of a file (eof
). We can then useif(!$k)
(if $k is not defined) as an equivalent toNR==FNR
in awk.- if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];}
: if this is the first file,
file2, sort fields 4,5 and 6, join them into a string and use that string as the key in the hash (associative array)
name. Then, save the variant's name from file2 as the value associated with that key. The sorting lets us treat
A Tand
T A` as equivalent.
else{
: if we're now reading the second file,file1
.
$var=join("", sort($F[2],$F[3],$F[4]));
: build the key. This time using fields 3, 4 and 5.
$F[1]=$name{$var} if $name{$var};
: set the 2nd field to the value stored in thename
hash, if there is a value for this key. Theif
is needed to make sure we don't change the header or any other variants that might be present infile1
but not infile2
.
print join "t", @F;
: print the fields, including the change just made above.
add a comment |
I'd do this in Perl because it has a sort
function that lets us treat A T
and T A
as the same thing easily. For example:
$ perl -lane 'if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1]; }else{$var=join("", sort($F[2],$F[3],$F[4])); $F[1]=$name{$var} if $name{$var};print join "t", @F; } $k++ if eof' file2 file1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906
Or, slightly more legibly:
$ perl -lane 'if(!$k){
$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];
}
else{
$var=join("", sort($F[2],$F[3],$F[4]));
$F[1]=$name{$var} if $name{$var};
print join "t", @F;
}
$k++ if eof' file2 file1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906
Explanation
perl -lane
: the-a
makes perl act like awk, automatically splitting its input into the array@F
on whitespace. Since perl arrays start at0
,$F[0]
will be the first field,$F[1]
will be the second etc. Field N is$F[N-1]
. The-n
makes perl read its arguments as text files and apply the script given by-e
to each line of them. The-l
just removes trailing newlines from each input line and adds a newline to eachprint
call.$k++ if eof
: this increments the the variable$k
by 1 if we've reached the end of a file (eof
). We can then useif(!$k)
(if $k is not defined) as an equivalent toNR==FNR
in awk.- if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];}
: if this is the first file,
file2, sort fields 4,5 and 6, join them into a string and use that string as the key in the hash (associative array)
name. Then, save the variant's name from file2 as the value associated with that key. The sorting lets us treat
A Tand
T A` as equivalent.
else{
: if we're now reading the second file,file1
.
$var=join("", sort($F[2],$F[3],$F[4]));
: build the key. This time using fields 3, 4 and 5.
$F[1]=$name{$var} if $name{$var};
: set the 2nd field to the value stored in thename
hash, if there is a value for this key. Theif
is needed to make sure we don't change the header or any other variants that might be present infile1
but not infile2
.
print join "t", @F;
: print the fields, including the change just made above.
add a comment |
I'd do this in Perl because it has a sort
function that lets us treat A T
and T A
as the same thing easily. For example:
$ perl -lane 'if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1]; }else{$var=join("", sort($F[2],$F[3],$F[4])); $F[1]=$name{$var} if $name{$var};print join "t", @F; } $k++ if eof' file2 file1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906
Or, slightly more legibly:
$ perl -lane 'if(!$k){
$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];
}
else{
$var=join("", sort($F[2],$F[3],$F[4]));
$F[1]=$name{$var} if $name{$var};
print join "t", @F;
}
$k++ if eof' file2 file1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906
Explanation
perl -lane
: the-a
makes perl act like awk, automatically splitting its input into the array@F
on whitespace. Since perl arrays start at0
,$F[0]
will be the first field,$F[1]
will be the second etc. Field N is$F[N-1]
. The-n
makes perl read its arguments as text files and apply the script given by-e
to each line of them. The-l
just removes trailing newlines from each input line and adds a newline to eachprint
call.$k++ if eof
: this increments the the variable$k
by 1 if we've reached the end of a file (eof
). We can then useif(!$k)
(if $k is not defined) as an equivalent toNR==FNR
in awk.- if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];}
: if this is the first file,
file2, sort fields 4,5 and 6, join them into a string and use that string as the key in the hash (associative array)
name. Then, save the variant's name from file2 as the value associated with that key. The sorting lets us treat
A Tand
T A` as equivalent.
else{
: if we're now reading the second file,file1
.
$var=join("", sort($F[2],$F[3],$F[4]));
: build the key. This time using fields 3, 4 and 5.
$F[1]=$name{$var} if $name{$var};
: set the 2nd field to the value stored in thename
hash, if there is a value for this key. Theif
is needed to make sure we don't change the header or any other variants that might be present infile1
but not infile2
.
print join "t", @F;
: print the fields, including the change just made above.
I'd do this in Perl because it has a sort
function that lets us treat A T
and T A
as the same thing easily. For example:
$ perl -lane 'if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1]; }else{$var=join("", sort($F[2],$F[3],$F[4])); $F[1]=$name{$var} if $name{$var};print join "t", @F; } $k++ if eof' file2 file1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906
Or, slightly more legibly:
$ perl -lane 'if(!$k){
$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];
}
else{
$var=join("", sort($F[2],$F[3],$F[4]));
$F[1]=$name{$var} if $name{$var};
print join "t", @F;
}
$k++ if eof' file2 file1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
1:79137 1:79137_A_T 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033 1:79033_A_G 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630 1:118630_C_T 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179 1:533179_A_G 533179 A G 1 -0.098 0.19 6.1e-01 185906
Explanation
perl -lane
: the-a
makes perl act like awk, automatically splitting its input into the array@F
on whitespace. Since perl arrays start at0
,$F[0]
will be the first field,$F[1]
will be the second etc. Field N is$F[N-1]
. The-n
makes perl read its arguments as text files and apply the script given by-e
to each line of them. The-l
just removes trailing newlines from each input line and adds a newline to eachprint
call.$k++ if eof
: this increments the the variable$k
by 1 if we've reached the end of a file (eof
). We can then useif(!$k)
(if $k is not defined) as an equivalent toNR==FNR
in awk.- if(!$k){$name{join("",sort($F[3],$F[4],$F[5]))}=$F[1];}
: if this is the first file,
file2, sort fields 4,5 and 6, join them into a string and use that string as the key in the hash (associative array)
name. Then, save the variant's name from file2 as the value associated with that key. The sorting lets us treat
A Tand
T A` as equivalent.
else{
: if we're now reading the second file,file1
.
$var=join("", sort($F[2],$F[3],$F[4]));
: build the key. This time using fields 3, 4 and 5.
$F[1]=$name{$var} if $name{$var};
: set the 2nd field to the value stored in thename
hash, if there is a value for this key. Theif
is needed to make sure we don't change the header or any other variants that might be present infile1
but not infile2
.
print join "t", @F;
: print the fields, including the change just made above.
edited yesterday
answered yesterday
terdon♦terdon
140k34 gold badges287 silver badges466 bronze badges
140k34 gold badges287 silver badges466 bronze badges
add a comment |
add a comment |
The join
command will do the work of joining up matching lines from multiple files. But it has some requirements on its input files, so you'll need to make some temporary files along the way, with a few extra fields.
awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }' < file1 | sort > 1.tmp
awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}' < file2 | sort > 2.tmp
sed q file1
join -t % -o 1.3 2.2 1.2 1.tmp 2.tmp | sort -t % -n | awk -F % '{print $2" "$3}'
Step by step
Preprocessing the first file:
awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }'
Example output:
1 118630 C T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311%4
Those 3 fields, separated by %
, are:
- the "key" that has to be matched (input fields 2-5)
- the original line, minus the first column (which is going to be replaced)
- the original line number (so we can restore the file order after
sort
)
This output is piped through sort
and into a temporary file, because join
requires its inputs to have been sorted.
For the second file:
awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}'
Example output:
1 118630 C T%1:118630_C_T
1 118630 T C%1:118630_C_T
As you specified that fields 5 and 6 should match either way round, a second line is printed with them swapped (provided that they aren't identical). The %
-separated fields here are
- the "key" to be matched
- column 2
Again, the output is piped through sort
and into another temporary file.
The sed q
copies the first line, ie. the headers, from file1
. (Unlike all the awk
steps, it doesn't collapse the whitespace, so the columns are even less well aligned than in the input.)
Then comes the main "join" step:
join -t % -o 1.3 2.2 1.2 1.tmp 2.tmp
The -t %
sets the separator to %
(rather than whitespace). The -o
argument produces the following three fields of output:
- file 1, column 3: the line number
- file 2, column 2: the original column 2 from
file2
- file 1, column 2: the rest of the line from
file1
Example output line:
4%1:118630_C_T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
Then sort
can restore the original file order (sort numerically, field separator %
)
sort -t % -n
and a final awk
can remove that line number and the %
s
awk -F % '{print $2" "$3}'
Final output line:
1:118630_C_T 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
add a comment |
The join
command will do the work of joining up matching lines from multiple files. But it has some requirements on its input files, so you'll need to make some temporary files along the way, with a few extra fields.
awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }' < file1 | sort > 1.tmp
awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}' < file2 | sort > 2.tmp
sed q file1
join -t % -o 1.3 2.2 1.2 1.tmp 2.tmp | sort -t % -n | awk -F % '{print $2" "$3}'
Step by step
Preprocessing the first file:
awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }'
Example output:
1 118630 C T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311%4
Those 3 fields, separated by %
, are:
- the "key" that has to be matched (input fields 2-5)
- the original line, minus the first column (which is going to be replaced)
- the original line number (so we can restore the file order after
sort
)
This output is piped through sort
and into a temporary file, because join
requires its inputs to have been sorted.
For the second file:
awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}'
Example output:
1 118630 C T%1:118630_C_T
1 118630 T C%1:118630_C_T
As you specified that fields 5 and 6 should match either way round, a second line is printed with them swapped (provided that they aren't identical). The %
-separated fields here are
- the "key" to be matched
- column 2
Again, the output is piped through sort
and into another temporary file.
The sed q
copies the first line, ie. the headers, from file1
. (Unlike all the awk
steps, it doesn't collapse the whitespace, so the columns are even less well aligned than in the input.)
Then comes the main "join" step:
join -t % -o 1.3 2.2 1.2 1.tmp 2.tmp
The -t %
sets the separator to %
(rather than whitespace). The -o
argument produces the following three fields of output:
- file 1, column 3: the line number
- file 2, column 2: the original column 2 from
file2
- file 1, column 2: the rest of the line from
file1
Example output line:
4%1:118630_C_T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
Then sort
can restore the original file order (sort numerically, field separator %
)
sort -t % -n
and a final awk
can remove that line number and the %
s
awk -F % '{print $2" "$3}'
Final output line:
1:118630_C_T 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
add a comment |
The join
command will do the work of joining up matching lines from multiple files. But it has some requirements on its input files, so you'll need to make some temporary files along the way, with a few extra fields.
awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }' < file1 | sort > 1.tmp
awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}' < file2 | sort > 2.tmp
sed q file1
join -t % -o 1.3 2.2 1.2 1.tmp 2.tmp | sort -t % -n | awk -F % '{print $2" "$3}'
Step by step
Preprocessing the first file:
awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }'
Example output:
1 118630 C T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311%4
Those 3 fields, separated by %
, are:
- the "key" that has to be matched (input fields 2-5)
- the original line, minus the first column (which is going to be replaced)
- the original line number (so we can restore the file order after
sort
)
This output is piped through sort
and into a temporary file, because join
requires its inputs to have been sorted.
For the second file:
awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}'
Example output:
1 118630 C T%1:118630_C_T
1 118630 T C%1:118630_C_T
As you specified that fields 5 and 6 should match either way round, a second line is printed with them swapped (provided that they aren't identical). The %
-separated fields here are
- the "key" to be matched
- column 2
Again, the output is piped through sort
and into another temporary file.
The sed q
copies the first line, ie. the headers, from file1
. (Unlike all the awk
steps, it doesn't collapse the whitespace, so the columns are even less well aligned than in the input.)
Then comes the main "join" step:
join -t % -o 1.3 2.2 1.2 1.tmp 2.tmp
The -t %
sets the separator to %
(rather than whitespace). The -o
argument produces the following three fields of output:
- file 1, column 3: the line number
- file 2, column 2: the original column 2 from
file2
- file 1, column 2: the rest of the line from
file1
Example output line:
4%1:118630_C_T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
Then sort
can restore the original file order (sort numerically, field separator %
)
sort -t % -n
and a final awk
can remove that line number and the %
s
awk -F % '{print $2" "$3}'
Final output line:
1:118630_C_T 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
The join
command will do the work of joining up matching lines from multiple files. But it has some requirements on its input files, so you'll need to make some temporary files along the way, with a few extra fields.
awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }' < file1 | sort > 1.tmp
awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}' < file2 | sort > 2.tmp
sed q file1
join -t % -o 1.3 2.2 1.2 1.tmp 2.tmp | sort -t % -n | awk -F % '{print $2" "$3}'
Step by step
Preprocessing the first file:
awk '{$1="";print $2" "$3" "$4" "$5 "%" $0 "%" NR }'
Example output:
1 118630 C T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311%4
Those 3 fields, separated by %
, are:
- the "key" that has to be matched (input fields 2-5)
- the original line, minus the first column (which is going to be replaced)
- the original line number (so we can restore the file order after
sort
)
This output is piped through sort
and into a temporary file, because join
requires its inputs to have been sorted.
For the second file:
awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}'
Example output:
1 118630 C T%1:118630_C_T
1 118630 T C%1:118630_C_T
As you specified that fields 5 and 6 should match either way round, a second line is printed with them swapped (provided that they aren't identical). The %
-separated fields here are
- the "key" to be matched
- column 2
Again, the output is piped through sort
and into another temporary file.
The sed q
copies the first line, ie. the headers, from file1
. (Unlike all the awk
steps, it doesn't collapse the whitespace, so the columns are even less well aligned than in the input.)
Then comes the main "join" step:
join -t % -o 1.3 2.2 1.2 1.tmp 2.tmp
The -t %
sets the separator to %
(rather than whitespace). The -o
argument produces the following three fields of output:
- file 1, column 3: the line number
- file 2, column 2: the original column 2 from
file2
- file 1, column 2: the rest of the line from
file1
Example output line:
4%1:118630_C_T% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
Then sort
can restore the original file order (sort numerically, field separator %
)
sort -t % -n
and a final awk
can remove that line number and the %
s
awk -F % '{print $2" "$3}'
Final output line:
1:118630_C_T 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
edited yesterday
answered yesterday
JigglyNagaJigglyNaga
4,41210 silver badges37 bronze badges
4,41210 silver badges37 bronze badges
add a comment |
add a comment |
cookiemonster is a new contributor. Be nice, and check out our Code of Conduct.
cookiemonster is a new contributor. Be nice, and check out our Code of Conduct.
cookiemonster is a new contributor. Be nice, and check out our Code of Conduct.
cookiemonster is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f535726%2fmatch-4-columns-and-replace-1-in-2-files%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Do you need to preserve the exact amount of whitespace between columns, or is it OK to collapse each gap down to a single space (which
awk
tends to do)?– JigglyNaga
yesterday
It is OK to collapse each gap down.
– cookiemonster
yesterday
1
Sounds like you might be interested in our sister site: Bioinformatics.
– terdon♦
yesterday