conditional replacing rows with a number Announcing the arrival of Valued Associate #679:...
Is it fair for a professor to grade us on the possession of past papers?
Why are there no cargo aircraft with "flying wing" design?
What does the "x" in "x86" represent?
For a new assistant professor in CS, how to build/manage a publication pipeline
If my PI received research grants from a company to be able to pay my postdoc salary, did I have a potential conflict interest too?
What is the meaning of the simile “quick as silk”?
Can anything be seen from the center of the Boötes void? How dark would it be?
Extracting terms with certain heads in a function
How do I stop a creek from eroding my steep embankment?
Why do we bend a book to keep it straight?
Can a party unilaterally change candidates in preparation for a General election?
Using audio cues to encourage good posture
Is it common practice to audition new musicians 1-2-1 before rehearsing with the entire band?
Significance of Cersei's obsession with elephants?
Irreducible of finite Krull dimension implies quasi-compact?
Why are the trig functions versine, haversine, exsecant, etc, rarely used in modern mathematics?
Why didn't Eitri join the fight?
When the Haste spell ends on a creature, do attackers have advantage against that creature?
What causes the direction of lightning flashes?
2001: A Space Odyssey's use of the song "Daisy Bell" (Bicycle Built for Two); life imitates art or vice-versa?
What is homebrew?
First console to have temporary backward compatibility
Is it cost-effective to upgrade an old-ish Giant Escape R3 commuter bike with entry-level branded parts (wheels, drivetrain)?
Do I really need to have a message in a novel to appeal to readers?
conditional replacing rows with a number
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Community Moderator Election Results
Why I closed the “Why is Kali so hard” questionbash scripts - remove duplicate rows with smaller valuePerl one-liner for replacing values greater than a threshholdsort CSV by number of column in rows?How to select rows based on how many consecutive times a number is present in a column?How to calculate the average number of columns across the rows as well as the maximum numbers of columns in a file in unix?How to split rows in a huge data file based on number of column within them in linux ?How to join rows with single columns to a maximum of 4 columns in one row?How to get count of unique rows in a file?replacing values in one with the values in another fileextract columns from TRUE/FALSE matrix based on proportion of TRUE values within the column
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}
I have a directory containing nearly 11 million small files: like this
wa_filtering_DP15_good_pops_snps_file_1
wa_filtering_DP15_good_pops_snps_file_2
.
.
.
wa_filtering_DP15_good_pops_snps_file_11232111
and each file has only 2 rows and 315 columns looks like this:
1 0 0 0 0 0 0 0 0 0 1 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0
I want to go through each file and if in each column both rows have 0 values replace them with 9 and get something like this:
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
Can someone help me out to figure out how to do that?
Thanks
text-processing
add a comment |
I have a directory containing nearly 11 million small files: like this
wa_filtering_DP15_good_pops_snps_file_1
wa_filtering_DP15_good_pops_snps_file_2
.
.
.
wa_filtering_DP15_good_pops_snps_file_11232111
and each file has only 2 rows and 315 columns looks like this:
1 0 0 0 0 0 0 0 0 0 1 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0
I want to go through each file and if in each column both rows have 0 values replace them with 9 and get something like this:
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
Can someone help me out to figure out how to do that?
Thanks
text-processing
With millions of small files, you're at risk of running out of inodes. Check withdf /path/to/filesversusdf -i /path/to/files
– glenn jackman
Sep 20 '17 at 19:33
I have a suspicion you would be better off rearchitecting, perhaps just to set up a database, but there's not enough information here to diagnose the real situation. ;) Good luck.
– Wildcard
Sep 20 '17 at 21:08
add a comment |
I have a directory containing nearly 11 million small files: like this
wa_filtering_DP15_good_pops_snps_file_1
wa_filtering_DP15_good_pops_snps_file_2
.
.
.
wa_filtering_DP15_good_pops_snps_file_11232111
and each file has only 2 rows and 315 columns looks like this:
1 0 0 0 0 0 0 0 0 0 1 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0
I want to go through each file and if in each column both rows have 0 values replace them with 9 and get something like this:
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
Can someone help me out to figure out how to do that?
Thanks
text-processing
I have a directory containing nearly 11 million small files: like this
wa_filtering_DP15_good_pops_snps_file_1
wa_filtering_DP15_good_pops_snps_file_2
.
.
.
wa_filtering_DP15_good_pops_snps_file_11232111
and each file has only 2 rows and 315 columns looks like this:
1 0 0 0 0 0 0 0 0 0 1 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0
I want to go through each file and if in each column both rows have 0 values replace them with 9 and get something like this:
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
Can someone help me out to figure out how to do that?
Thanks
text-processing
text-processing
edited Sep 20 '17 at 18:03
Jeff Schaller♦
45.1k1164147
45.1k1164147
asked Sep 20 '17 at 17:29
Anna1364Anna1364
456214
456214
With millions of small files, you're at risk of running out of inodes. Check withdf /path/to/filesversusdf -i /path/to/files
– glenn jackman
Sep 20 '17 at 19:33
I have a suspicion you would be better off rearchitecting, perhaps just to set up a database, but there's not enough information here to diagnose the real situation. ;) Good luck.
– Wildcard
Sep 20 '17 at 21:08
add a comment |
With millions of small files, you're at risk of running out of inodes. Check withdf /path/to/filesversusdf -i /path/to/files
– glenn jackman
Sep 20 '17 at 19:33
I have a suspicion you would be better off rearchitecting, perhaps just to set up a database, but there's not enough information here to diagnose the real situation. ;) Good luck.
– Wildcard
Sep 20 '17 at 21:08
With millions of small files, you're at risk of running out of inodes. Check with
df /path/to/files versus df -i /path/to/files– glenn jackman
Sep 20 '17 at 19:33
With millions of small files, you're at risk of running out of inodes. Check with
df /path/to/files versus df -i /path/to/files– glenn jackman
Sep 20 '17 at 19:33
I have a suspicion you would be better off rearchitecting, perhaps just to set up a database, but there's not enough information here to diagnose the real situation. ;) Good luck.
– Wildcard
Sep 20 '17 at 21:08
I have a suspicion you would be better off rearchitecting, perhaps just to set up a database, but there's not enough information here to diagnose the real situation. ;) Good luck.
– Wildcard
Sep 20 '17 at 21:08
add a comment |
6 Answers
6
active
oldest
votes
Here is awk solution.
awk '{split($0,ary1,/[ ]+/); getline x; split(x,ary2,/[ ]+/);
for (i in ary1)if (!(ary1[i]+ary2[i])){ary1[i]=ary2[i]=9}}
END{for (r=1;r<=NF;r++) printf ("%d ", ary1[r]); printf"n";
for (z=1;z<=NF;z++) printf ("%d ", ary2[z]); printf"n"}' infile
Explanations:
split($0,ary1,/[ ]+/);: reads and splits the first line into an arrayary1with one-or-more spaces delimiters between.getline x; split(x,ary2,/[ ]+/);: reads the second line into variablexand split it into arrayary2.for (i in ary1)if (!(ary1[i]+ary2[i])){ary1[i]=ary2[i]=9}}: loop in arrayary1for each index iniif sum of both fields value were zero (!(0)will triggerif(1)as true condition) then set both fields value to9.for (r=1;r<=NF;r++) printf ("%d ", ary1[r]); printf"n";: Now print final values of each arrayary1and in next lineary2.
To apply on all ~11 million files, just save changes in FILENAME.out format where FILENAME indicate current input fileName reading by awk.
awk '{split($0,ary1,/[ ]+/); getline x; split(x,ary2,/[ ]+/);
for (i in ary1)if (!(ary1[i]+ary2[i])){ary1[i]=ary2[i]=9}}
END{for (r=1;r<=NF;r++) printf ("%d ", ary1[r])>FILENAME".out"; printf"n">FILENAME".out";
for (z=1;z<=NF;z++) printf ("%d ", ary2[z])>FILENAME".out"
}' wa_filtering_DP15_good_pops_snps_file_{1..11232111}
If any number can be negative, then the sum of two numbers may be zero without any of them being zero...
– Kusalananda♦
Sep 20 '17 at 18:24
this one OP not mentioned yet, will update once he confirmed if has negative values, just simply change condition toary1[i]==0 && ary2[i]==0: )
– αғsнιη
Sep 20 '17 at 18:25
add a comment |
For kicks, here's Ruby
ruby -e '
data = File.readlines(ARGV.shift)
.map {|line| line.split.map(&:to_i)}
.transpose
.map {|(a,b)| (a==0 && b==0) ? [9,9] : [a,b]}
.transpose
.each {|row| puts row.join(" ")}
' file
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
To replace all the files:
ruby -e '
require "tempfile"
require "pathname"
Pathname.new("/path/to/your/files/").each_child do |pathname|
next unless pathname.file?
temp = Tempfile.new(pathname.basename.to_s)
filename = pathname.to_s
File.readlines(filename)
.map {|line| line.split.map(&:to_i)}
.transpose
.map {|(a,b)| (a==0 && b==0) ? [9,9] : [a,b]}
.transpose
.each {|row| temp.puts row.join(" ")}
temp.close
File.link filename, filename+".bak"
File.rename temp.path, filename
end
'
You may not want theFile.linkstep if you're running out of inodes.
– glenn jackman
Sep 20 '17 at 19:34
add a comment |
This is an alternative approach, which might be slow for million of files compared to pure awk solutions.
Using something like this, you can transpose rows to columns:
$ cat file1
1 0 0 0 0 0 0 0 0 0 1 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0
$ paste -d'-' <(head -n1 file1 |tr -s ' ' 'n') <(tail -n1 file1 |tr -s ' ' 'n')
1-0
0-0
0-0
0-0
0-0
0-0
0-0
0-0
0-0
0-0
1-0
2-0
1-0
You can then replace all 0-0 occurences with 9-9 with a simple sed, and you can store the output to a temp variable:
$ f1=$(sed 's/0-0/9-9/g' <(paste -d'-' <(head -n1 file1|tr -s ' ' 'n') <(tail -n1 file1 |tr -s ' ' 'n')))
$ echo "$f1"
1-0
9-9
9-9
9-9
9-9
9-9
9-9
9-9
9-9
9-9
1-0
2-0
1-0
You can now revert back from columns to rows like:
$ awk -F'-' 'NR==FNR{printf "%s ",$1;p=1;next}p{printf "n";p=0}{printf "%s ",$2}END{printf "n"}' <(echo "$f1") <(echo "$f1")
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
And you can also append >file1 at the end of last awk command to overwrite the file1 with the new contents.
Only thing left is to loop over all files. Can be done with a kind of bash loop:
for f in ./wa_filtering_DP15_good_pops_snps_file_*;do
f1=$(sed 's/0-0/9-9/g' <(paste -d'-' <(head -n1 "$f"|tr -s ' ' 'n') <(tail -n1 "$f" |tr -s ' ' 'n')))
awk -F'-' 'NR==FNR{printf "%s ",$1;p=1;next}p{printf "n";p=0}{printf "%s ",$2}END{printf "n"}' <(echo "$f1") <(echo "$f1") #>"$f" #uncomment >"$f" to overwrite the files...
done
add a comment |
With awk:
NR == 1 { # save the values from 1st line in array t
split($0, t, FS);
}
NR == 2 { # compare values from second line with those stored in array t
for ( i = 1; i <= NF; ++i ) {
# build l1 and l2 (line 1 and line 2) based on comparison
if ($i == 0 && t[i] == 0) {
l1 = (i == 1 ? 9 : l1 OFS 9 );
l2 = (i == 1 ? 9 : l2 OFS 9 );
} else {
l1 = (i == 1 ? t[i] : l1 OFS t[i] );
l2 = (i == 1 ? $i : l2 OFS $i );
}
}
}
END { # output the two constructed lines
print l1;
print l2;
}
Running it on the example file:
$ awk -f script.awk file
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
Running on all files matching wa_filtering_DP15_good_pops_snps_file_* in current directory:
mkdir modified
for name in wa_filtering_DP15_good_pops_snps_file_*; do
awk -f script.awk "$name" >"modified/$name.new"
done
This will create a new file for each input file, with the name of the original file and an extra .new suffix. The new files will be placed in the modified folder in the current directory.
- I opted for creating new files so that the originals are left unmodified.
- I opted to put the new files in a new directory, as having 22 million files in a single directory could make the filesystem be a bit awkward to work with.
In general, try not to create millions of files in a single directory. Instead either
- create many subdirectories and distribute the files in them, maybe based on a binning algorithm working on that last integer of the filename, or a hash, or
- create a single output file that aggregates all data, possibly with extra lines of text identifying what the following two lines refer to.
The following variant will be more efficiently run on millions of files:
FNR == 1 { # save the values from 1st line in array t
split($0, t, FS);
}
FNR == 2 { # compare values from second line with those stored in array t
for ( i = 1; i <= NF; ++i ) {
# build l1 and l2 (line 1 and line 2) based on comparison
if ($i == 0 && t[i] == 0) {
l1 = (i == 1 ? 9 : l1 OFS 9 );
l2 = (i == 1 ? 9 : l2 OFS 9 );
} else {
l1 = (i == 1 ? t[i] : l1 OFS t[i] );
l2 = (i == 1 ? $i : l2 OFS $i );
}
}
# create output filename based on input filename
# and output the two lines
f = "modified/" FILENAME ".new";
print l1 >f;
print l2 >f;
}
To run it:
mkdir modified
find . -maxdepth 1 -type f -name 'wa_filtering_DP15_good_pops_snps_file_*'
-exec awk -f script.awk {} +
The new files will be generated in the modified folder as before, but this time only a fraction of awk processes will be started and the speed of processing will be greatly increased.
I simplified your for loop body to this: ` up = t[i]; low = $i; if (up == 0 && low == 0) { up = 9; low = 9; } if(i != 1) { up = OFS up; low = OFS low; } l1 = l1 up; l2 = l2 low; `. Clearer and cleaner, in my opinion. Plus, one if test less.
– MiniMax
Sep 21 '17 at 12:57
@MiniMax That may be a good modification, I agree. I will fix it as soon as I have spare time on my hands (busy ATM).
– Kusalananda♦
Sep 21 '17 at 13:00
add a comment |
First variant:
For single file:
datamash -W transpose < input.txt | sed 's/0t0/9t9/' | datamash transpose
For many files do the same in the loop:
for i in *; do datamash -W transpose < "$i" |
sed 's/0t0/9t9/' |
datamash transpose > "new_$i"; done
This loop will create the new, changed file for the each file, with the prefix "new_" added. Then you can remove all old files and remove prefix "new_" from filenames.
Second variant:
This is a solution for the single file, for multiple files use loop, as in the previous variant.
tr 'n' 't' < input.txt |
awk '{
num = NF / 2;
for(up = 1; up <= NF; up++) {
if(up <= num) {
low = num + up;
if(!$up && !$low) {
$up = 9;
$low = 9;
}
}
printf "%st", $up;
if(up % num == 0)
print "";
}
}'
Explanation
tr 'n' 't' < input.txt- join two lines together.
awk
- checks the one element from the first line and the adjacent element from the second line simultaneously, like: 1 and 316, 2 and 317, 3 and 318, so on.
- if both elements are 0, it changes them to 9.
- print fields by the order - 1, 2, 3, 4 ... 628, 629, 630.
- Each time the element number is a multiple of the number of elements in the line, adds a new line.
Input
1 0 0 0 0 0 0 0 0 0 1 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0
Output
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
'for i in *' is probably not a valid solution for 11 million files. Use xargs instead.
– RonJohn
Sep 21 '17 at 0:59
1
@RonJohn*will only be a problem if the shell uses it to execute an external command.for i in *will not be a problem on 11 million files in itself.
– Kusalananda♦
Sep 21 '17 at 6:15
But the '*' will expand in the bash command buffer, certainly overflowing it.
– RonJohn
Sep 22 '17 at 15:40
@RonJohn Here is the answer to your question.
– MiniMax
Sep 22 '17 at 17:22
add a comment |
Probably not efficient enough for 11 million files but it's a different approach on the substitution. Takes one argument on the command line; the name of the directory where all of the files are stored. The name for the directory could be hard coded instead (see notes in code). The base name for the file is already hard coded without the number at the end (not required).
#!/bin/bash
# compare two rows in a file
# when both are 0, change both to 9
# otherwise keep original value
ProgName=${0##*/}
Pid=$$
DBG_FNAME=""
scriptUsage() {
cat <<ENDUSE
$ProgName </path/to/directory> [ [-d|--debug] || [-f|--filename] ]
path/to/directory: Path to directory (NO trailing '/')
-f|--filename: Print the each file name to stdout after complete
-d|--debug: Run in debug mode (Implies filename option - SEE NOTE*)
-h|--help: Print this help message
NOTE: USING [-d|--debug] AUTOMATICALLY SETS [-f|--filename]
You DO NOT need both together!
ENDUSE
}
# check args
#!# NOTE: you can delete from here to #!!# above 'WorkDir="$1"'
[[ -z $1 ]] && { >&2 echo "MISSING file source directory!"; scriptUsage; exit 1; }
[[ $1 == "-h" || $1 == "--help" ]] && { scriptUsage; exit 0; }
[[ -d $1 ]] || { >&2 echo "Unable to locate directory [$1]"; exit 1; }
if (( $# > 2 ))
then
DBG_FNAME=1
>&2 echo "Running in debug mode from using ${2} & ${3} together!"
echo "PID is: $Pid"
sleep 2
set -x
else
[[ $2 == "-f" || $2 == "--filename" ]] && DBG_FNAME=1
[[ $2 == "-d" || $2 == "--debug" ]] && { echo "PID is: $Pid"; set -x; }
fi
#!!# to here #!!#
# directory as arg[1] or change to hardcoded
WorkDir="$1"
# check for/remove trailing slash
[[ ${WorkDir:(-1)} == / ]] && WorkDir=${WorkDir:0:((${#WorkDir}-1))}
# given file root withOUT number ending
WorkFile="${WorkDir}/wa_filtering_DP15_good_pops_snps_file_"
##== MAIN LOOP
for file in ${WorkFile}*
do
# reset these after each file
TopRow=""
BotRow=""
NewTop=""
NewBot=""
SKIPME=""
# get top row of file
TopRow=$(sed -n '1{p;q}' $file)
# get bottom row of file
BotRow=$(sed -n '2{p;q}' $file)
##-- EACH FILE LOOP
for (( f=0; f<${#TopRow}; f++ ))
do
if [[ -n $SKIPME ]]
then
# SKIPME is -z by default so
# this runs every other time through
NewTop="${NewTop} "
NewBot="${NewBot} "
SKIPME=""
elif (( $((${TopRow:${f}:1}+${BotRow:${f}:1})) == 0 ))
then
# 0+0=0 so change to 9
NewTop="${NewTop}9"
NewBot="${NewBot}9"
SKIPME=1
else
# (1+0 or 0+1)!=0 so keep originals
NewTop="${NewTop}${TopRow:${f}:1}"
NewBot="${NewBot}${BotRow:${f}:1}"
SKIPME=1
fi
done
##--
# overwrite original file
printf "%sn%s" "$NewTop" "$NewBot" > $file
# if -f|--filename given print file name
[[ -n $DBG_FNAME ]] && echo "$file is complete"
done
##==
DOES EDIT FILES IN PLACE. Wouldn't be hard to have it make backups as it runs. Returns files exactly the way requested above.
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f393455%2fconditional-replacing-rows-with-a-number%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
Here is awk solution.
awk '{split($0,ary1,/[ ]+/); getline x; split(x,ary2,/[ ]+/);
for (i in ary1)if (!(ary1[i]+ary2[i])){ary1[i]=ary2[i]=9}}
END{for (r=1;r<=NF;r++) printf ("%d ", ary1[r]); printf"n";
for (z=1;z<=NF;z++) printf ("%d ", ary2[z]); printf"n"}' infile
Explanations:
split($0,ary1,/[ ]+/);: reads and splits the first line into an arrayary1with one-or-more spaces delimiters between.getline x; split(x,ary2,/[ ]+/);: reads the second line into variablexand split it into arrayary2.for (i in ary1)if (!(ary1[i]+ary2[i])){ary1[i]=ary2[i]=9}}: loop in arrayary1for each index iniif sum of both fields value were zero (!(0)will triggerif(1)as true condition) then set both fields value to9.for (r=1;r<=NF;r++) printf ("%d ", ary1[r]); printf"n";: Now print final values of each arrayary1and in next lineary2.
To apply on all ~11 million files, just save changes in FILENAME.out format where FILENAME indicate current input fileName reading by awk.
awk '{split($0,ary1,/[ ]+/); getline x; split(x,ary2,/[ ]+/);
for (i in ary1)if (!(ary1[i]+ary2[i])){ary1[i]=ary2[i]=9}}
END{for (r=1;r<=NF;r++) printf ("%d ", ary1[r])>FILENAME".out"; printf"n">FILENAME".out";
for (z=1;z<=NF;z++) printf ("%d ", ary2[z])>FILENAME".out"
}' wa_filtering_DP15_good_pops_snps_file_{1..11232111}
If any number can be negative, then the sum of two numbers may be zero without any of them being zero...
– Kusalananda♦
Sep 20 '17 at 18:24
this one OP not mentioned yet, will update once he confirmed if has negative values, just simply change condition toary1[i]==0 && ary2[i]==0: )
– αғsнιη
Sep 20 '17 at 18:25
add a comment |
Here is awk solution.
awk '{split($0,ary1,/[ ]+/); getline x; split(x,ary2,/[ ]+/);
for (i in ary1)if (!(ary1[i]+ary2[i])){ary1[i]=ary2[i]=9}}
END{for (r=1;r<=NF;r++) printf ("%d ", ary1[r]); printf"n";
for (z=1;z<=NF;z++) printf ("%d ", ary2[z]); printf"n"}' infile
Explanations:
split($0,ary1,/[ ]+/);: reads and splits the first line into an arrayary1with one-or-more spaces delimiters between.getline x; split(x,ary2,/[ ]+/);: reads the second line into variablexand split it into arrayary2.for (i in ary1)if (!(ary1[i]+ary2[i])){ary1[i]=ary2[i]=9}}: loop in arrayary1for each index iniif sum of both fields value were zero (!(0)will triggerif(1)as true condition) then set both fields value to9.for (r=1;r<=NF;r++) printf ("%d ", ary1[r]); printf"n";: Now print final values of each arrayary1and in next lineary2.
To apply on all ~11 million files, just save changes in FILENAME.out format where FILENAME indicate current input fileName reading by awk.
awk '{split($0,ary1,/[ ]+/); getline x; split(x,ary2,/[ ]+/);
for (i in ary1)if (!(ary1[i]+ary2[i])){ary1[i]=ary2[i]=9}}
END{for (r=1;r<=NF;r++) printf ("%d ", ary1[r])>FILENAME".out"; printf"n">FILENAME".out";
for (z=1;z<=NF;z++) printf ("%d ", ary2[z])>FILENAME".out"
}' wa_filtering_DP15_good_pops_snps_file_{1..11232111}
If any number can be negative, then the sum of two numbers may be zero without any of them being zero...
– Kusalananda♦
Sep 20 '17 at 18:24
this one OP not mentioned yet, will update once he confirmed if has negative values, just simply change condition toary1[i]==0 && ary2[i]==0: )
– αғsнιη
Sep 20 '17 at 18:25
add a comment |
Here is awk solution.
awk '{split($0,ary1,/[ ]+/); getline x; split(x,ary2,/[ ]+/);
for (i in ary1)if (!(ary1[i]+ary2[i])){ary1[i]=ary2[i]=9}}
END{for (r=1;r<=NF;r++) printf ("%d ", ary1[r]); printf"n";
for (z=1;z<=NF;z++) printf ("%d ", ary2[z]); printf"n"}' infile
Explanations:
split($0,ary1,/[ ]+/);: reads and splits the first line into an arrayary1with one-or-more spaces delimiters between.getline x; split(x,ary2,/[ ]+/);: reads the second line into variablexand split it into arrayary2.for (i in ary1)if (!(ary1[i]+ary2[i])){ary1[i]=ary2[i]=9}}: loop in arrayary1for each index iniif sum of both fields value were zero (!(0)will triggerif(1)as true condition) then set both fields value to9.for (r=1;r<=NF;r++) printf ("%d ", ary1[r]); printf"n";: Now print final values of each arrayary1and in next lineary2.
To apply on all ~11 million files, just save changes in FILENAME.out format where FILENAME indicate current input fileName reading by awk.
awk '{split($0,ary1,/[ ]+/); getline x; split(x,ary2,/[ ]+/);
for (i in ary1)if (!(ary1[i]+ary2[i])){ary1[i]=ary2[i]=9}}
END{for (r=1;r<=NF;r++) printf ("%d ", ary1[r])>FILENAME".out"; printf"n">FILENAME".out";
for (z=1;z<=NF;z++) printf ("%d ", ary2[z])>FILENAME".out"
}' wa_filtering_DP15_good_pops_snps_file_{1..11232111}
Here is awk solution.
awk '{split($0,ary1,/[ ]+/); getline x; split(x,ary2,/[ ]+/);
for (i in ary1)if (!(ary1[i]+ary2[i])){ary1[i]=ary2[i]=9}}
END{for (r=1;r<=NF;r++) printf ("%d ", ary1[r]); printf"n";
for (z=1;z<=NF;z++) printf ("%d ", ary2[z]); printf"n"}' infile
Explanations:
split($0,ary1,/[ ]+/);: reads and splits the first line into an arrayary1with one-or-more spaces delimiters between.getline x; split(x,ary2,/[ ]+/);: reads the second line into variablexand split it into arrayary2.for (i in ary1)if (!(ary1[i]+ary2[i])){ary1[i]=ary2[i]=9}}: loop in arrayary1for each index iniif sum of both fields value were zero (!(0)will triggerif(1)as true condition) then set both fields value to9.for (r=1;r<=NF;r++) printf ("%d ", ary1[r]); printf"n";: Now print final values of each arrayary1and in next lineary2.
To apply on all ~11 million files, just save changes in FILENAME.out format where FILENAME indicate current input fileName reading by awk.
awk '{split($0,ary1,/[ ]+/); getline x; split(x,ary2,/[ ]+/);
for (i in ary1)if (!(ary1[i]+ary2[i])){ary1[i]=ary2[i]=9}}
END{for (r=1;r<=NF;r++) printf ("%d ", ary1[r])>FILENAME".out"; printf"n">FILENAME".out";
for (z=1;z<=NF;z++) printf ("%d ", ary2[z])>FILENAME".out"
}' wa_filtering_DP15_good_pops_snps_file_{1..11232111}
edited Sep 28 '17 at 16:49
answered Sep 20 '17 at 18:11
αғsнιηαғsнιη
17.4k103070
17.4k103070
If any number can be negative, then the sum of two numbers may be zero without any of them being zero...
– Kusalananda♦
Sep 20 '17 at 18:24
this one OP not mentioned yet, will update once he confirmed if has negative values, just simply change condition toary1[i]==0 && ary2[i]==0: )
– αғsнιη
Sep 20 '17 at 18:25
add a comment |
If any number can be negative, then the sum of two numbers may be zero without any of them being zero...
– Kusalananda♦
Sep 20 '17 at 18:24
this one OP not mentioned yet, will update once he confirmed if has negative values, just simply change condition toary1[i]==0 && ary2[i]==0: )
– αғsнιη
Sep 20 '17 at 18:25
If any number can be negative, then the sum of two numbers may be zero without any of them being zero...
– Kusalananda♦
Sep 20 '17 at 18:24
If any number can be negative, then the sum of two numbers may be zero without any of them being zero...
– Kusalananda♦
Sep 20 '17 at 18:24
this one OP not mentioned yet, will update once he confirmed if has negative values, just simply change condition to
ary1[i]==0 && ary2[i]==0 : )– αғsнιη
Sep 20 '17 at 18:25
this one OP not mentioned yet, will update once he confirmed if has negative values, just simply change condition to
ary1[i]==0 && ary2[i]==0 : )– αғsнιη
Sep 20 '17 at 18:25
add a comment |
For kicks, here's Ruby
ruby -e '
data = File.readlines(ARGV.shift)
.map {|line| line.split.map(&:to_i)}
.transpose
.map {|(a,b)| (a==0 && b==0) ? [9,9] : [a,b]}
.transpose
.each {|row| puts row.join(" ")}
' file
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
To replace all the files:
ruby -e '
require "tempfile"
require "pathname"
Pathname.new("/path/to/your/files/").each_child do |pathname|
next unless pathname.file?
temp = Tempfile.new(pathname.basename.to_s)
filename = pathname.to_s
File.readlines(filename)
.map {|line| line.split.map(&:to_i)}
.transpose
.map {|(a,b)| (a==0 && b==0) ? [9,9] : [a,b]}
.transpose
.each {|row| temp.puts row.join(" ")}
temp.close
File.link filename, filename+".bak"
File.rename temp.path, filename
end
'
You may not want theFile.linkstep if you're running out of inodes.
– glenn jackman
Sep 20 '17 at 19:34
add a comment |
For kicks, here's Ruby
ruby -e '
data = File.readlines(ARGV.shift)
.map {|line| line.split.map(&:to_i)}
.transpose
.map {|(a,b)| (a==0 && b==0) ? [9,9] : [a,b]}
.transpose
.each {|row| puts row.join(" ")}
' file
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
To replace all the files:
ruby -e '
require "tempfile"
require "pathname"
Pathname.new("/path/to/your/files/").each_child do |pathname|
next unless pathname.file?
temp = Tempfile.new(pathname.basename.to_s)
filename = pathname.to_s
File.readlines(filename)
.map {|line| line.split.map(&:to_i)}
.transpose
.map {|(a,b)| (a==0 && b==0) ? [9,9] : [a,b]}
.transpose
.each {|row| temp.puts row.join(" ")}
temp.close
File.link filename, filename+".bak"
File.rename temp.path, filename
end
'
You may not want theFile.linkstep if you're running out of inodes.
– glenn jackman
Sep 20 '17 at 19:34
add a comment |
For kicks, here's Ruby
ruby -e '
data = File.readlines(ARGV.shift)
.map {|line| line.split.map(&:to_i)}
.transpose
.map {|(a,b)| (a==0 && b==0) ? [9,9] : [a,b]}
.transpose
.each {|row| puts row.join(" ")}
' file
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
To replace all the files:
ruby -e '
require "tempfile"
require "pathname"
Pathname.new("/path/to/your/files/").each_child do |pathname|
next unless pathname.file?
temp = Tempfile.new(pathname.basename.to_s)
filename = pathname.to_s
File.readlines(filename)
.map {|line| line.split.map(&:to_i)}
.transpose
.map {|(a,b)| (a==0 && b==0) ? [9,9] : [a,b]}
.transpose
.each {|row| temp.puts row.join(" ")}
temp.close
File.link filename, filename+".bak"
File.rename temp.path, filename
end
'
For kicks, here's Ruby
ruby -e '
data = File.readlines(ARGV.shift)
.map {|line| line.split.map(&:to_i)}
.transpose
.map {|(a,b)| (a==0 && b==0) ? [9,9] : [a,b]}
.transpose
.each {|row| puts row.join(" ")}
' file
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
To replace all the files:
ruby -e '
require "tempfile"
require "pathname"
Pathname.new("/path/to/your/files/").each_child do |pathname|
next unless pathname.file?
temp = Tempfile.new(pathname.basename.to_s)
filename = pathname.to_s
File.readlines(filename)
.map {|line| line.split.map(&:to_i)}
.transpose
.map {|(a,b)| (a==0 && b==0) ? [9,9] : [a,b]}
.transpose
.each {|row| temp.puts row.join(" ")}
temp.close
File.link filename, filename+".bak"
File.rename temp.path, filename
end
'
edited Sep 20 '17 at 19:28
answered Sep 20 '17 at 18:37
glenn jackmanglenn jackman
53.1k573114
53.1k573114
You may not want theFile.linkstep if you're running out of inodes.
– glenn jackman
Sep 20 '17 at 19:34
add a comment |
You may not want theFile.linkstep if you're running out of inodes.
– glenn jackman
Sep 20 '17 at 19:34
You may not want the
File.link step if you're running out of inodes.– glenn jackman
Sep 20 '17 at 19:34
You may not want the
File.link step if you're running out of inodes.– glenn jackman
Sep 20 '17 at 19:34
add a comment |
This is an alternative approach, which might be slow for million of files compared to pure awk solutions.
Using something like this, you can transpose rows to columns:
$ cat file1
1 0 0 0 0 0 0 0 0 0 1 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0
$ paste -d'-' <(head -n1 file1 |tr -s ' ' 'n') <(tail -n1 file1 |tr -s ' ' 'n')
1-0
0-0
0-0
0-0
0-0
0-0
0-0
0-0
0-0
0-0
1-0
2-0
1-0
You can then replace all 0-0 occurences with 9-9 with a simple sed, and you can store the output to a temp variable:
$ f1=$(sed 's/0-0/9-9/g' <(paste -d'-' <(head -n1 file1|tr -s ' ' 'n') <(tail -n1 file1 |tr -s ' ' 'n')))
$ echo "$f1"
1-0
9-9
9-9
9-9
9-9
9-9
9-9
9-9
9-9
9-9
1-0
2-0
1-0
You can now revert back from columns to rows like:
$ awk -F'-' 'NR==FNR{printf "%s ",$1;p=1;next}p{printf "n";p=0}{printf "%s ",$2}END{printf "n"}' <(echo "$f1") <(echo "$f1")
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
And you can also append >file1 at the end of last awk command to overwrite the file1 with the new contents.
Only thing left is to loop over all files. Can be done with a kind of bash loop:
for f in ./wa_filtering_DP15_good_pops_snps_file_*;do
f1=$(sed 's/0-0/9-9/g' <(paste -d'-' <(head -n1 "$f"|tr -s ' ' 'n') <(tail -n1 "$f" |tr -s ' ' 'n')))
awk -F'-' 'NR==FNR{printf "%s ",$1;p=1;next}p{printf "n";p=0}{printf "%s ",$2}END{printf "n"}' <(echo "$f1") <(echo "$f1") #>"$f" #uncomment >"$f" to overwrite the files...
done
add a comment |
This is an alternative approach, which might be slow for million of files compared to pure awk solutions.
Using something like this, you can transpose rows to columns:
$ cat file1
1 0 0 0 0 0 0 0 0 0 1 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0
$ paste -d'-' <(head -n1 file1 |tr -s ' ' 'n') <(tail -n1 file1 |tr -s ' ' 'n')
1-0
0-0
0-0
0-0
0-0
0-0
0-0
0-0
0-0
0-0
1-0
2-0
1-0
You can then replace all 0-0 occurences with 9-9 with a simple sed, and you can store the output to a temp variable:
$ f1=$(sed 's/0-0/9-9/g' <(paste -d'-' <(head -n1 file1|tr -s ' ' 'n') <(tail -n1 file1 |tr -s ' ' 'n')))
$ echo "$f1"
1-0
9-9
9-9
9-9
9-9
9-9
9-9
9-9
9-9
9-9
1-0
2-0
1-0
You can now revert back from columns to rows like:
$ awk -F'-' 'NR==FNR{printf "%s ",$1;p=1;next}p{printf "n";p=0}{printf "%s ",$2}END{printf "n"}' <(echo "$f1") <(echo "$f1")
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
And you can also append >file1 at the end of last awk command to overwrite the file1 with the new contents.
Only thing left is to loop over all files. Can be done with a kind of bash loop:
for f in ./wa_filtering_DP15_good_pops_snps_file_*;do
f1=$(sed 's/0-0/9-9/g' <(paste -d'-' <(head -n1 "$f"|tr -s ' ' 'n') <(tail -n1 "$f" |tr -s ' ' 'n')))
awk -F'-' 'NR==FNR{printf "%s ",$1;p=1;next}p{printf "n";p=0}{printf "%s ",$2}END{printf "n"}' <(echo "$f1") <(echo "$f1") #>"$f" #uncomment >"$f" to overwrite the files...
done
add a comment |
This is an alternative approach, which might be slow for million of files compared to pure awk solutions.
Using something like this, you can transpose rows to columns:
$ cat file1
1 0 0 0 0 0 0 0 0 0 1 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0
$ paste -d'-' <(head -n1 file1 |tr -s ' ' 'n') <(tail -n1 file1 |tr -s ' ' 'n')
1-0
0-0
0-0
0-0
0-0
0-0
0-0
0-0
0-0
0-0
1-0
2-0
1-0
You can then replace all 0-0 occurences with 9-9 with a simple sed, and you can store the output to a temp variable:
$ f1=$(sed 's/0-0/9-9/g' <(paste -d'-' <(head -n1 file1|tr -s ' ' 'n') <(tail -n1 file1 |tr -s ' ' 'n')))
$ echo "$f1"
1-0
9-9
9-9
9-9
9-9
9-9
9-9
9-9
9-9
9-9
1-0
2-0
1-0
You can now revert back from columns to rows like:
$ awk -F'-' 'NR==FNR{printf "%s ",$1;p=1;next}p{printf "n";p=0}{printf "%s ",$2}END{printf "n"}' <(echo "$f1") <(echo "$f1")
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
And you can also append >file1 at the end of last awk command to overwrite the file1 with the new contents.
Only thing left is to loop over all files. Can be done with a kind of bash loop:
for f in ./wa_filtering_DP15_good_pops_snps_file_*;do
f1=$(sed 's/0-0/9-9/g' <(paste -d'-' <(head -n1 "$f"|tr -s ' ' 'n') <(tail -n1 "$f" |tr -s ' ' 'n')))
awk -F'-' 'NR==FNR{printf "%s ",$1;p=1;next}p{printf "n";p=0}{printf "%s ",$2}END{printf "n"}' <(echo "$f1") <(echo "$f1") #>"$f" #uncomment >"$f" to overwrite the files...
done
This is an alternative approach, which might be slow for million of files compared to pure awk solutions.
Using something like this, you can transpose rows to columns:
$ cat file1
1 0 0 0 0 0 0 0 0 0 1 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0
$ paste -d'-' <(head -n1 file1 |tr -s ' ' 'n') <(tail -n1 file1 |tr -s ' ' 'n')
1-0
0-0
0-0
0-0
0-0
0-0
0-0
0-0
0-0
0-0
1-0
2-0
1-0
You can then replace all 0-0 occurences with 9-9 with a simple sed, and you can store the output to a temp variable:
$ f1=$(sed 's/0-0/9-9/g' <(paste -d'-' <(head -n1 file1|tr -s ' ' 'n') <(tail -n1 file1 |tr -s ' ' 'n')))
$ echo "$f1"
1-0
9-9
9-9
9-9
9-9
9-9
9-9
9-9
9-9
9-9
1-0
2-0
1-0
You can now revert back from columns to rows like:
$ awk -F'-' 'NR==FNR{printf "%s ",$1;p=1;next}p{printf "n";p=0}{printf "%s ",$2}END{printf "n"}' <(echo "$f1") <(echo "$f1")
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
And you can also append >file1 at the end of last awk command to overwrite the file1 with the new contents.
Only thing left is to loop over all files. Can be done with a kind of bash loop:
for f in ./wa_filtering_DP15_good_pops_snps_file_*;do
f1=$(sed 's/0-0/9-9/g' <(paste -d'-' <(head -n1 "$f"|tr -s ' ' 'n') <(tail -n1 "$f" |tr -s ' ' 'n')))
awk -F'-' 'NR==FNR{printf "%s ",$1;p=1;next}p{printf "n";p=0}{printf "%s ",$2}END{printf "n"}' <(echo "$f1") <(echo "$f1") #>"$f" #uncomment >"$f" to overwrite the files...
done
answered Sep 20 '17 at 20:12
George VasiliouGeorge Vasiliou
5,83531130
5,83531130
add a comment |
add a comment |
With awk:
NR == 1 { # save the values from 1st line in array t
split($0, t, FS);
}
NR == 2 { # compare values from second line with those stored in array t
for ( i = 1; i <= NF; ++i ) {
# build l1 and l2 (line 1 and line 2) based on comparison
if ($i == 0 && t[i] == 0) {
l1 = (i == 1 ? 9 : l1 OFS 9 );
l2 = (i == 1 ? 9 : l2 OFS 9 );
} else {
l1 = (i == 1 ? t[i] : l1 OFS t[i] );
l2 = (i == 1 ? $i : l2 OFS $i );
}
}
}
END { # output the two constructed lines
print l1;
print l2;
}
Running it on the example file:
$ awk -f script.awk file
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
Running on all files matching wa_filtering_DP15_good_pops_snps_file_* in current directory:
mkdir modified
for name in wa_filtering_DP15_good_pops_snps_file_*; do
awk -f script.awk "$name" >"modified/$name.new"
done
This will create a new file for each input file, with the name of the original file and an extra .new suffix. The new files will be placed in the modified folder in the current directory.
- I opted for creating new files so that the originals are left unmodified.
- I opted to put the new files in a new directory, as having 22 million files in a single directory could make the filesystem be a bit awkward to work with.
In general, try not to create millions of files in a single directory. Instead either
- create many subdirectories and distribute the files in them, maybe based on a binning algorithm working on that last integer of the filename, or a hash, or
- create a single output file that aggregates all data, possibly with extra lines of text identifying what the following two lines refer to.
The following variant will be more efficiently run on millions of files:
FNR == 1 { # save the values from 1st line in array t
split($0, t, FS);
}
FNR == 2 { # compare values from second line with those stored in array t
for ( i = 1; i <= NF; ++i ) {
# build l1 and l2 (line 1 and line 2) based on comparison
if ($i == 0 && t[i] == 0) {
l1 = (i == 1 ? 9 : l1 OFS 9 );
l2 = (i == 1 ? 9 : l2 OFS 9 );
} else {
l1 = (i == 1 ? t[i] : l1 OFS t[i] );
l2 = (i == 1 ? $i : l2 OFS $i );
}
}
# create output filename based on input filename
# and output the two lines
f = "modified/" FILENAME ".new";
print l1 >f;
print l2 >f;
}
To run it:
mkdir modified
find . -maxdepth 1 -type f -name 'wa_filtering_DP15_good_pops_snps_file_*'
-exec awk -f script.awk {} +
The new files will be generated in the modified folder as before, but this time only a fraction of awk processes will be started and the speed of processing will be greatly increased.
I simplified your for loop body to this: ` up = t[i]; low = $i; if (up == 0 && low == 0) { up = 9; low = 9; } if(i != 1) { up = OFS up; low = OFS low; } l1 = l1 up; l2 = l2 low; `. Clearer and cleaner, in my opinion. Plus, one if test less.
– MiniMax
Sep 21 '17 at 12:57
@MiniMax That may be a good modification, I agree. I will fix it as soon as I have spare time on my hands (busy ATM).
– Kusalananda♦
Sep 21 '17 at 13:00
add a comment |
With awk:
NR == 1 { # save the values from 1st line in array t
split($0, t, FS);
}
NR == 2 { # compare values from second line with those stored in array t
for ( i = 1; i <= NF; ++i ) {
# build l1 and l2 (line 1 and line 2) based on comparison
if ($i == 0 && t[i] == 0) {
l1 = (i == 1 ? 9 : l1 OFS 9 );
l2 = (i == 1 ? 9 : l2 OFS 9 );
} else {
l1 = (i == 1 ? t[i] : l1 OFS t[i] );
l2 = (i == 1 ? $i : l2 OFS $i );
}
}
}
END { # output the two constructed lines
print l1;
print l2;
}
Running it on the example file:
$ awk -f script.awk file
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
Running on all files matching wa_filtering_DP15_good_pops_snps_file_* in current directory:
mkdir modified
for name in wa_filtering_DP15_good_pops_snps_file_*; do
awk -f script.awk "$name" >"modified/$name.new"
done
This will create a new file for each input file, with the name of the original file and an extra .new suffix. The new files will be placed in the modified folder in the current directory.
- I opted for creating new files so that the originals are left unmodified.
- I opted to put the new files in a new directory, as having 22 million files in a single directory could make the filesystem be a bit awkward to work with.
In general, try not to create millions of files in a single directory. Instead either
- create many subdirectories and distribute the files in them, maybe based on a binning algorithm working on that last integer of the filename, or a hash, or
- create a single output file that aggregates all data, possibly with extra lines of text identifying what the following two lines refer to.
The following variant will be more efficiently run on millions of files:
FNR == 1 { # save the values from 1st line in array t
split($0, t, FS);
}
FNR == 2 { # compare values from second line with those stored in array t
for ( i = 1; i <= NF; ++i ) {
# build l1 and l2 (line 1 and line 2) based on comparison
if ($i == 0 && t[i] == 0) {
l1 = (i == 1 ? 9 : l1 OFS 9 );
l2 = (i == 1 ? 9 : l2 OFS 9 );
} else {
l1 = (i == 1 ? t[i] : l1 OFS t[i] );
l2 = (i == 1 ? $i : l2 OFS $i );
}
}
# create output filename based on input filename
# and output the two lines
f = "modified/" FILENAME ".new";
print l1 >f;
print l2 >f;
}
To run it:
mkdir modified
find . -maxdepth 1 -type f -name 'wa_filtering_DP15_good_pops_snps_file_*'
-exec awk -f script.awk {} +
The new files will be generated in the modified folder as before, but this time only a fraction of awk processes will be started and the speed of processing will be greatly increased.
I simplified your for loop body to this: ` up = t[i]; low = $i; if (up == 0 && low == 0) { up = 9; low = 9; } if(i != 1) { up = OFS up; low = OFS low; } l1 = l1 up; l2 = l2 low; `. Clearer and cleaner, in my opinion. Plus, one if test less.
– MiniMax
Sep 21 '17 at 12:57
@MiniMax That may be a good modification, I agree. I will fix it as soon as I have spare time on my hands (busy ATM).
– Kusalananda♦
Sep 21 '17 at 13:00
add a comment |
With awk:
NR == 1 { # save the values from 1st line in array t
split($0, t, FS);
}
NR == 2 { # compare values from second line with those stored in array t
for ( i = 1; i <= NF; ++i ) {
# build l1 and l2 (line 1 and line 2) based on comparison
if ($i == 0 && t[i] == 0) {
l1 = (i == 1 ? 9 : l1 OFS 9 );
l2 = (i == 1 ? 9 : l2 OFS 9 );
} else {
l1 = (i == 1 ? t[i] : l1 OFS t[i] );
l2 = (i == 1 ? $i : l2 OFS $i );
}
}
}
END { # output the two constructed lines
print l1;
print l2;
}
Running it on the example file:
$ awk -f script.awk file
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
Running on all files matching wa_filtering_DP15_good_pops_snps_file_* in current directory:
mkdir modified
for name in wa_filtering_DP15_good_pops_snps_file_*; do
awk -f script.awk "$name" >"modified/$name.new"
done
This will create a new file for each input file, with the name of the original file and an extra .new suffix. The new files will be placed in the modified folder in the current directory.
- I opted for creating new files so that the originals are left unmodified.
- I opted to put the new files in a new directory, as having 22 million files in a single directory could make the filesystem be a bit awkward to work with.
In general, try not to create millions of files in a single directory. Instead either
- create many subdirectories and distribute the files in them, maybe based on a binning algorithm working on that last integer of the filename, or a hash, or
- create a single output file that aggregates all data, possibly with extra lines of text identifying what the following two lines refer to.
The following variant will be more efficiently run on millions of files:
FNR == 1 { # save the values from 1st line in array t
split($0, t, FS);
}
FNR == 2 { # compare values from second line with those stored in array t
for ( i = 1; i <= NF; ++i ) {
# build l1 and l2 (line 1 and line 2) based on comparison
if ($i == 0 && t[i] == 0) {
l1 = (i == 1 ? 9 : l1 OFS 9 );
l2 = (i == 1 ? 9 : l2 OFS 9 );
} else {
l1 = (i == 1 ? t[i] : l1 OFS t[i] );
l2 = (i == 1 ? $i : l2 OFS $i );
}
}
# create output filename based on input filename
# and output the two lines
f = "modified/" FILENAME ".new";
print l1 >f;
print l2 >f;
}
To run it:
mkdir modified
find . -maxdepth 1 -type f -name 'wa_filtering_DP15_good_pops_snps_file_*'
-exec awk -f script.awk {} +
The new files will be generated in the modified folder as before, but this time only a fraction of awk processes will be started and the speed of processing will be greatly increased.
With awk:
NR == 1 { # save the values from 1st line in array t
split($0, t, FS);
}
NR == 2 { # compare values from second line with those stored in array t
for ( i = 1; i <= NF; ++i ) {
# build l1 and l2 (line 1 and line 2) based on comparison
if ($i == 0 && t[i] == 0) {
l1 = (i == 1 ? 9 : l1 OFS 9 );
l2 = (i == 1 ? 9 : l2 OFS 9 );
} else {
l1 = (i == 1 ? t[i] : l1 OFS t[i] );
l2 = (i == 1 ? $i : l2 OFS $i );
}
}
}
END { # output the two constructed lines
print l1;
print l2;
}
Running it on the example file:
$ awk -f script.awk file
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
Running on all files matching wa_filtering_DP15_good_pops_snps_file_* in current directory:
mkdir modified
for name in wa_filtering_DP15_good_pops_snps_file_*; do
awk -f script.awk "$name" >"modified/$name.new"
done
This will create a new file for each input file, with the name of the original file and an extra .new suffix. The new files will be placed in the modified folder in the current directory.
- I opted for creating new files so that the originals are left unmodified.
- I opted to put the new files in a new directory, as having 22 million files in a single directory could make the filesystem be a bit awkward to work with.
In general, try not to create millions of files in a single directory. Instead either
- create many subdirectories and distribute the files in them, maybe based on a binning algorithm working on that last integer of the filename, or a hash, or
- create a single output file that aggregates all data, possibly with extra lines of text identifying what the following two lines refer to.
The following variant will be more efficiently run on millions of files:
FNR == 1 { # save the values from 1st line in array t
split($0, t, FS);
}
FNR == 2 { # compare values from second line with those stored in array t
for ( i = 1; i <= NF; ++i ) {
# build l1 and l2 (line 1 and line 2) based on comparison
if ($i == 0 && t[i] == 0) {
l1 = (i == 1 ? 9 : l1 OFS 9 );
l2 = (i == 1 ? 9 : l2 OFS 9 );
} else {
l1 = (i == 1 ? t[i] : l1 OFS t[i] );
l2 = (i == 1 ? $i : l2 OFS $i );
}
}
# create output filename based on input filename
# and output the two lines
f = "modified/" FILENAME ".new";
print l1 >f;
print l2 >f;
}
To run it:
mkdir modified
find . -maxdepth 1 -type f -name 'wa_filtering_DP15_good_pops_snps_file_*'
-exec awk -f script.awk {} +
The new files will be generated in the modified folder as before, but this time only a fraction of awk processes will be started and the speed of processing will be greatly increased.
edited Sep 21 '17 at 11:35
answered Sep 20 '17 at 18:23
Kusalananda♦Kusalananda
142k18266442
142k18266442
I simplified your for loop body to this: ` up = t[i]; low = $i; if (up == 0 && low == 0) { up = 9; low = 9; } if(i != 1) { up = OFS up; low = OFS low; } l1 = l1 up; l2 = l2 low; `. Clearer and cleaner, in my opinion. Plus, one if test less.
– MiniMax
Sep 21 '17 at 12:57
@MiniMax That may be a good modification, I agree. I will fix it as soon as I have spare time on my hands (busy ATM).
– Kusalananda♦
Sep 21 '17 at 13:00
add a comment |
I simplified your for loop body to this: ` up = t[i]; low = $i; if (up == 0 && low == 0) { up = 9; low = 9; } if(i != 1) { up = OFS up; low = OFS low; } l1 = l1 up; l2 = l2 low; `. Clearer and cleaner, in my opinion. Plus, one if test less.
– MiniMax
Sep 21 '17 at 12:57
@MiniMax That may be a good modification, I agree. I will fix it as soon as I have spare time on my hands (busy ATM).
– Kusalananda♦
Sep 21 '17 at 13:00
I simplified your for loop body to this: ` up = t[i]; low = $i; if (up == 0 && low == 0) { up = 9; low = 9; } if(i != 1) { up = OFS up; low = OFS low; } l1 = l1 up; l2 = l2 low; `. Clearer and cleaner, in my opinion. Plus, one if test less.
– MiniMax
Sep 21 '17 at 12:57
I simplified your for loop body to this: ` up = t[i]; low = $i; if (up == 0 && low == 0) { up = 9; low = 9; } if(i != 1) { up = OFS up; low = OFS low; } l1 = l1 up; l2 = l2 low; `. Clearer and cleaner, in my opinion. Plus, one if test less.
– MiniMax
Sep 21 '17 at 12:57
@MiniMax That may be a good modification, I agree. I will fix it as soon as I have spare time on my hands (busy ATM).
– Kusalananda♦
Sep 21 '17 at 13:00
@MiniMax That may be a good modification, I agree. I will fix it as soon as I have spare time on my hands (busy ATM).
– Kusalananda♦
Sep 21 '17 at 13:00
add a comment |
First variant:
For single file:
datamash -W transpose < input.txt | sed 's/0t0/9t9/' | datamash transpose
For many files do the same in the loop:
for i in *; do datamash -W transpose < "$i" |
sed 's/0t0/9t9/' |
datamash transpose > "new_$i"; done
This loop will create the new, changed file for the each file, with the prefix "new_" added. Then you can remove all old files and remove prefix "new_" from filenames.
Second variant:
This is a solution for the single file, for multiple files use loop, as in the previous variant.
tr 'n' 't' < input.txt |
awk '{
num = NF / 2;
for(up = 1; up <= NF; up++) {
if(up <= num) {
low = num + up;
if(!$up && !$low) {
$up = 9;
$low = 9;
}
}
printf "%st", $up;
if(up % num == 0)
print "";
}
}'
Explanation
tr 'n' 't' < input.txt- join two lines together.
awk
- checks the one element from the first line and the adjacent element from the second line simultaneously, like: 1 and 316, 2 and 317, 3 and 318, so on.
- if both elements are 0, it changes them to 9.
- print fields by the order - 1, 2, 3, 4 ... 628, 629, 630.
- Each time the element number is a multiple of the number of elements in the line, adds a new line.
Input
1 0 0 0 0 0 0 0 0 0 1 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0
Output
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
'for i in *' is probably not a valid solution for 11 million files. Use xargs instead.
– RonJohn
Sep 21 '17 at 0:59
1
@RonJohn*will only be a problem if the shell uses it to execute an external command.for i in *will not be a problem on 11 million files in itself.
– Kusalananda♦
Sep 21 '17 at 6:15
But the '*' will expand in the bash command buffer, certainly overflowing it.
– RonJohn
Sep 22 '17 at 15:40
@RonJohn Here is the answer to your question.
– MiniMax
Sep 22 '17 at 17:22
add a comment |
First variant:
For single file:
datamash -W transpose < input.txt | sed 's/0t0/9t9/' | datamash transpose
For many files do the same in the loop:
for i in *; do datamash -W transpose < "$i" |
sed 's/0t0/9t9/' |
datamash transpose > "new_$i"; done
This loop will create the new, changed file for the each file, with the prefix "new_" added. Then you can remove all old files and remove prefix "new_" from filenames.
Second variant:
This is a solution for the single file, for multiple files use loop, as in the previous variant.
tr 'n' 't' < input.txt |
awk '{
num = NF / 2;
for(up = 1; up <= NF; up++) {
if(up <= num) {
low = num + up;
if(!$up && !$low) {
$up = 9;
$low = 9;
}
}
printf "%st", $up;
if(up % num == 0)
print "";
}
}'
Explanation
tr 'n' 't' < input.txt- join two lines together.
awk
- checks the one element from the first line and the adjacent element from the second line simultaneously, like: 1 and 316, 2 and 317, 3 and 318, so on.
- if both elements are 0, it changes them to 9.
- print fields by the order - 1, 2, 3, 4 ... 628, 629, 630.
- Each time the element number is a multiple of the number of elements in the line, adds a new line.
Input
1 0 0 0 0 0 0 0 0 0 1 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0
Output
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
'for i in *' is probably not a valid solution for 11 million files. Use xargs instead.
– RonJohn
Sep 21 '17 at 0:59
1
@RonJohn*will only be a problem if the shell uses it to execute an external command.for i in *will not be a problem on 11 million files in itself.
– Kusalananda♦
Sep 21 '17 at 6:15
But the '*' will expand in the bash command buffer, certainly overflowing it.
– RonJohn
Sep 22 '17 at 15:40
@RonJohn Here is the answer to your question.
– MiniMax
Sep 22 '17 at 17:22
add a comment |
First variant:
For single file:
datamash -W transpose < input.txt | sed 's/0t0/9t9/' | datamash transpose
For many files do the same in the loop:
for i in *; do datamash -W transpose < "$i" |
sed 's/0t0/9t9/' |
datamash transpose > "new_$i"; done
This loop will create the new, changed file for the each file, with the prefix "new_" added. Then you can remove all old files and remove prefix "new_" from filenames.
Second variant:
This is a solution for the single file, for multiple files use loop, as in the previous variant.
tr 'n' 't' < input.txt |
awk '{
num = NF / 2;
for(up = 1; up <= NF; up++) {
if(up <= num) {
low = num + up;
if(!$up && !$low) {
$up = 9;
$low = 9;
}
}
printf "%st", $up;
if(up % num == 0)
print "";
}
}'
Explanation
tr 'n' 't' < input.txt- join two lines together.
awk
- checks the one element from the first line and the adjacent element from the second line simultaneously, like: 1 and 316, 2 and 317, 3 and 318, so on.
- if both elements are 0, it changes them to 9.
- print fields by the order - 1, 2, 3, 4 ... 628, 629, 630.
- Each time the element number is a multiple of the number of elements in the line, adds a new line.
Input
1 0 0 0 0 0 0 0 0 0 1 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0
Output
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
First variant:
For single file:
datamash -W transpose < input.txt | sed 's/0t0/9t9/' | datamash transpose
For many files do the same in the loop:
for i in *; do datamash -W transpose < "$i" |
sed 's/0t0/9t9/' |
datamash transpose > "new_$i"; done
This loop will create the new, changed file for the each file, with the prefix "new_" added. Then you can remove all old files and remove prefix "new_" from filenames.
Second variant:
This is a solution for the single file, for multiple files use loop, as in the previous variant.
tr 'n' 't' < input.txt |
awk '{
num = NF / 2;
for(up = 1; up <= NF; up++) {
if(up <= num) {
low = num + up;
if(!$up && !$low) {
$up = 9;
$low = 9;
}
}
printf "%st", $up;
if(up % num == 0)
print "";
}
}'
Explanation
tr 'n' 't' < input.txt- join two lines together.
awk
- checks the one element from the first line and the adjacent element from the second line simultaneously, like: 1 and 316, 2 and 317, 3 and 318, so on.
- if both elements are 0, it changes them to 9.
- print fields by the order - 1, 2, 3, 4 ... 628, 629, 630.
- Each time the element number is a multiple of the number of elements in the line, adds a new line.
Input
1 0 0 0 0 0 0 0 0 0 1 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0
Output
1 9 9 9 9 9 9 9 9 9 1 2 1
0 9 9 9 9 9 9 9 9 9 0 0 0
edited Sep 21 '17 at 21:10
answered Sep 21 '17 at 0:02
MiniMaxMiniMax
2,831819
2,831819
'for i in *' is probably not a valid solution for 11 million files. Use xargs instead.
– RonJohn
Sep 21 '17 at 0:59
1
@RonJohn*will only be a problem if the shell uses it to execute an external command.for i in *will not be a problem on 11 million files in itself.
– Kusalananda♦
Sep 21 '17 at 6:15
But the '*' will expand in the bash command buffer, certainly overflowing it.
– RonJohn
Sep 22 '17 at 15:40
@RonJohn Here is the answer to your question.
– MiniMax
Sep 22 '17 at 17:22
add a comment |
'for i in *' is probably not a valid solution for 11 million files. Use xargs instead.
– RonJohn
Sep 21 '17 at 0:59
1
@RonJohn*will only be a problem if the shell uses it to execute an external command.for i in *will not be a problem on 11 million files in itself.
– Kusalananda♦
Sep 21 '17 at 6:15
But the '*' will expand in the bash command buffer, certainly overflowing it.
– RonJohn
Sep 22 '17 at 15:40
@RonJohn Here is the answer to your question.
– MiniMax
Sep 22 '17 at 17:22
'for i in *' is probably not a valid solution for 11 million files. Use xargs instead.
– RonJohn
Sep 21 '17 at 0:59
'for i in *' is probably not a valid solution for 11 million files. Use xargs instead.
– RonJohn
Sep 21 '17 at 0:59
1
1
@RonJohn
* will only be a problem if the shell uses it to execute an external command. for i in * will not be a problem on 11 million files in itself.– Kusalananda♦
Sep 21 '17 at 6:15
@RonJohn
* will only be a problem if the shell uses it to execute an external command. for i in * will not be a problem on 11 million files in itself.– Kusalananda♦
Sep 21 '17 at 6:15
But the '*' will expand in the bash command buffer, certainly overflowing it.
– RonJohn
Sep 22 '17 at 15:40
But the '*' will expand in the bash command buffer, certainly overflowing it.
– RonJohn
Sep 22 '17 at 15:40
@RonJohn Here is the answer to your question.
– MiniMax
Sep 22 '17 at 17:22
@RonJohn Here is the answer to your question.
– MiniMax
Sep 22 '17 at 17:22
add a comment |
Probably not efficient enough for 11 million files but it's a different approach on the substitution. Takes one argument on the command line; the name of the directory where all of the files are stored. The name for the directory could be hard coded instead (see notes in code). The base name for the file is already hard coded without the number at the end (not required).
#!/bin/bash
# compare two rows in a file
# when both are 0, change both to 9
# otherwise keep original value
ProgName=${0##*/}
Pid=$$
DBG_FNAME=""
scriptUsage() {
cat <<ENDUSE
$ProgName </path/to/directory> [ [-d|--debug] || [-f|--filename] ]
path/to/directory: Path to directory (NO trailing '/')
-f|--filename: Print the each file name to stdout after complete
-d|--debug: Run in debug mode (Implies filename option - SEE NOTE*)
-h|--help: Print this help message
NOTE: USING [-d|--debug] AUTOMATICALLY SETS [-f|--filename]
You DO NOT need both together!
ENDUSE
}
# check args
#!# NOTE: you can delete from here to #!!# above 'WorkDir="$1"'
[[ -z $1 ]] && { >&2 echo "MISSING file source directory!"; scriptUsage; exit 1; }
[[ $1 == "-h" || $1 == "--help" ]] && { scriptUsage; exit 0; }
[[ -d $1 ]] || { >&2 echo "Unable to locate directory [$1]"; exit 1; }
if (( $# > 2 ))
then
DBG_FNAME=1
>&2 echo "Running in debug mode from using ${2} & ${3} together!"
echo "PID is: $Pid"
sleep 2
set -x
else
[[ $2 == "-f" || $2 == "--filename" ]] && DBG_FNAME=1
[[ $2 == "-d" || $2 == "--debug" ]] && { echo "PID is: $Pid"; set -x; }
fi
#!!# to here #!!#
# directory as arg[1] or change to hardcoded
WorkDir="$1"
# check for/remove trailing slash
[[ ${WorkDir:(-1)} == / ]] && WorkDir=${WorkDir:0:((${#WorkDir}-1))}
# given file root withOUT number ending
WorkFile="${WorkDir}/wa_filtering_DP15_good_pops_snps_file_"
##== MAIN LOOP
for file in ${WorkFile}*
do
# reset these after each file
TopRow=""
BotRow=""
NewTop=""
NewBot=""
SKIPME=""
# get top row of file
TopRow=$(sed -n '1{p;q}' $file)
# get bottom row of file
BotRow=$(sed -n '2{p;q}' $file)
##-- EACH FILE LOOP
for (( f=0; f<${#TopRow}; f++ ))
do
if [[ -n $SKIPME ]]
then
# SKIPME is -z by default so
# this runs every other time through
NewTop="${NewTop} "
NewBot="${NewBot} "
SKIPME=""
elif (( $((${TopRow:${f}:1}+${BotRow:${f}:1})) == 0 ))
then
# 0+0=0 so change to 9
NewTop="${NewTop}9"
NewBot="${NewBot}9"
SKIPME=1
else
# (1+0 or 0+1)!=0 so keep originals
NewTop="${NewTop}${TopRow:${f}:1}"
NewBot="${NewBot}${BotRow:${f}:1}"
SKIPME=1
fi
done
##--
# overwrite original file
printf "%sn%s" "$NewTop" "$NewBot" > $file
# if -f|--filename given print file name
[[ -n $DBG_FNAME ]] && echo "$file is complete"
done
##==
DOES EDIT FILES IN PLACE. Wouldn't be hard to have it make backups as it runs. Returns files exactly the way requested above.
add a comment |
Probably not efficient enough for 11 million files but it's a different approach on the substitution. Takes one argument on the command line; the name of the directory where all of the files are stored. The name for the directory could be hard coded instead (see notes in code). The base name for the file is already hard coded without the number at the end (not required).
#!/bin/bash
# compare two rows in a file
# when both are 0, change both to 9
# otherwise keep original value
ProgName=${0##*/}
Pid=$$
DBG_FNAME=""
scriptUsage() {
cat <<ENDUSE
$ProgName </path/to/directory> [ [-d|--debug] || [-f|--filename] ]
path/to/directory: Path to directory (NO trailing '/')
-f|--filename: Print the each file name to stdout after complete
-d|--debug: Run in debug mode (Implies filename option - SEE NOTE*)
-h|--help: Print this help message
NOTE: USING [-d|--debug] AUTOMATICALLY SETS [-f|--filename]
You DO NOT need both together!
ENDUSE
}
# check args
#!# NOTE: you can delete from here to #!!# above 'WorkDir="$1"'
[[ -z $1 ]] && { >&2 echo "MISSING file source directory!"; scriptUsage; exit 1; }
[[ $1 == "-h" || $1 == "--help" ]] && { scriptUsage; exit 0; }
[[ -d $1 ]] || { >&2 echo "Unable to locate directory [$1]"; exit 1; }
if (( $# > 2 ))
then
DBG_FNAME=1
>&2 echo "Running in debug mode from using ${2} & ${3} together!"
echo "PID is: $Pid"
sleep 2
set -x
else
[[ $2 == "-f" || $2 == "--filename" ]] && DBG_FNAME=1
[[ $2 == "-d" || $2 == "--debug" ]] && { echo "PID is: $Pid"; set -x; }
fi
#!!# to here #!!#
# directory as arg[1] or change to hardcoded
WorkDir="$1"
# check for/remove trailing slash
[[ ${WorkDir:(-1)} == / ]] && WorkDir=${WorkDir:0:((${#WorkDir}-1))}
# given file root withOUT number ending
WorkFile="${WorkDir}/wa_filtering_DP15_good_pops_snps_file_"
##== MAIN LOOP
for file in ${WorkFile}*
do
# reset these after each file
TopRow=""
BotRow=""
NewTop=""
NewBot=""
SKIPME=""
# get top row of file
TopRow=$(sed -n '1{p;q}' $file)
# get bottom row of file
BotRow=$(sed -n '2{p;q}' $file)
##-- EACH FILE LOOP
for (( f=0; f<${#TopRow}; f++ ))
do
if [[ -n $SKIPME ]]
then
# SKIPME is -z by default so
# this runs every other time through
NewTop="${NewTop} "
NewBot="${NewBot} "
SKIPME=""
elif (( $((${TopRow:${f}:1}+${BotRow:${f}:1})) == 0 ))
then
# 0+0=0 so change to 9
NewTop="${NewTop}9"
NewBot="${NewBot}9"
SKIPME=1
else
# (1+0 or 0+1)!=0 so keep originals
NewTop="${NewTop}${TopRow:${f}:1}"
NewBot="${NewBot}${BotRow:${f}:1}"
SKIPME=1
fi
done
##--
# overwrite original file
printf "%sn%s" "$NewTop" "$NewBot" > $file
# if -f|--filename given print file name
[[ -n $DBG_FNAME ]] && echo "$file is complete"
done
##==
DOES EDIT FILES IN PLACE. Wouldn't be hard to have it make backups as it runs. Returns files exactly the way requested above.
add a comment |
Probably not efficient enough for 11 million files but it's a different approach on the substitution. Takes one argument on the command line; the name of the directory where all of the files are stored. The name for the directory could be hard coded instead (see notes in code). The base name for the file is already hard coded without the number at the end (not required).
#!/bin/bash
# compare two rows in a file
# when both are 0, change both to 9
# otherwise keep original value
ProgName=${0##*/}
Pid=$$
DBG_FNAME=""
scriptUsage() {
cat <<ENDUSE
$ProgName </path/to/directory> [ [-d|--debug] || [-f|--filename] ]
path/to/directory: Path to directory (NO trailing '/')
-f|--filename: Print the each file name to stdout after complete
-d|--debug: Run in debug mode (Implies filename option - SEE NOTE*)
-h|--help: Print this help message
NOTE: USING [-d|--debug] AUTOMATICALLY SETS [-f|--filename]
You DO NOT need both together!
ENDUSE
}
# check args
#!# NOTE: you can delete from here to #!!# above 'WorkDir="$1"'
[[ -z $1 ]] && { >&2 echo "MISSING file source directory!"; scriptUsage; exit 1; }
[[ $1 == "-h" || $1 == "--help" ]] && { scriptUsage; exit 0; }
[[ -d $1 ]] || { >&2 echo "Unable to locate directory [$1]"; exit 1; }
if (( $# > 2 ))
then
DBG_FNAME=1
>&2 echo "Running in debug mode from using ${2} & ${3} together!"
echo "PID is: $Pid"
sleep 2
set -x
else
[[ $2 == "-f" || $2 == "--filename" ]] && DBG_FNAME=1
[[ $2 == "-d" || $2 == "--debug" ]] && { echo "PID is: $Pid"; set -x; }
fi
#!!# to here #!!#
# directory as arg[1] or change to hardcoded
WorkDir="$1"
# check for/remove trailing slash
[[ ${WorkDir:(-1)} == / ]] && WorkDir=${WorkDir:0:((${#WorkDir}-1))}
# given file root withOUT number ending
WorkFile="${WorkDir}/wa_filtering_DP15_good_pops_snps_file_"
##== MAIN LOOP
for file in ${WorkFile}*
do
# reset these after each file
TopRow=""
BotRow=""
NewTop=""
NewBot=""
SKIPME=""
# get top row of file
TopRow=$(sed -n '1{p;q}' $file)
# get bottom row of file
BotRow=$(sed -n '2{p;q}' $file)
##-- EACH FILE LOOP
for (( f=0; f<${#TopRow}; f++ ))
do
if [[ -n $SKIPME ]]
then
# SKIPME is -z by default so
# this runs every other time through
NewTop="${NewTop} "
NewBot="${NewBot} "
SKIPME=""
elif (( $((${TopRow:${f}:1}+${BotRow:${f}:1})) == 0 ))
then
# 0+0=0 so change to 9
NewTop="${NewTop}9"
NewBot="${NewBot}9"
SKIPME=1
else
# (1+0 or 0+1)!=0 so keep originals
NewTop="${NewTop}${TopRow:${f}:1}"
NewBot="${NewBot}${BotRow:${f}:1}"
SKIPME=1
fi
done
##--
# overwrite original file
printf "%sn%s" "$NewTop" "$NewBot" > $file
# if -f|--filename given print file name
[[ -n $DBG_FNAME ]] && echo "$file is complete"
done
##==
DOES EDIT FILES IN PLACE. Wouldn't be hard to have it make backups as it runs. Returns files exactly the way requested above.
Probably not efficient enough for 11 million files but it's a different approach on the substitution. Takes one argument on the command line; the name of the directory where all of the files are stored. The name for the directory could be hard coded instead (see notes in code). The base name for the file is already hard coded without the number at the end (not required).
#!/bin/bash
# compare two rows in a file
# when both are 0, change both to 9
# otherwise keep original value
ProgName=${0##*/}
Pid=$$
DBG_FNAME=""
scriptUsage() {
cat <<ENDUSE
$ProgName </path/to/directory> [ [-d|--debug] || [-f|--filename] ]
path/to/directory: Path to directory (NO trailing '/')
-f|--filename: Print the each file name to stdout after complete
-d|--debug: Run in debug mode (Implies filename option - SEE NOTE*)
-h|--help: Print this help message
NOTE: USING [-d|--debug] AUTOMATICALLY SETS [-f|--filename]
You DO NOT need both together!
ENDUSE
}
# check args
#!# NOTE: you can delete from here to #!!# above 'WorkDir="$1"'
[[ -z $1 ]] && { >&2 echo "MISSING file source directory!"; scriptUsage; exit 1; }
[[ $1 == "-h" || $1 == "--help" ]] && { scriptUsage; exit 0; }
[[ -d $1 ]] || { >&2 echo "Unable to locate directory [$1]"; exit 1; }
if (( $# > 2 ))
then
DBG_FNAME=1
>&2 echo "Running in debug mode from using ${2} & ${3} together!"
echo "PID is: $Pid"
sleep 2
set -x
else
[[ $2 == "-f" || $2 == "--filename" ]] && DBG_FNAME=1
[[ $2 == "-d" || $2 == "--debug" ]] && { echo "PID is: $Pid"; set -x; }
fi
#!!# to here #!!#
# directory as arg[1] or change to hardcoded
WorkDir="$1"
# check for/remove trailing slash
[[ ${WorkDir:(-1)} == / ]] && WorkDir=${WorkDir:0:((${#WorkDir}-1))}
# given file root withOUT number ending
WorkFile="${WorkDir}/wa_filtering_DP15_good_pops_snps_file_"
##== MAIN LOOP
for file in ${WorkFile}*
do
# reset these after each file
TopRow=""
BotRow=""
NewTop=""
NewBot=""
SKIPME=""
# get top row of file
TopRow=$(sed -n '1{p;q}' $file)
# get bottom row of file
BotRow=$(sed -n '2{p;q}' $file)
##-- EACH FILE LOOP
for (( f=0; f<${#TopRow}; f++ ))
do
if [[ -n $SKIPME ]]
then
# SKIPME is -z by default so
# this runs every other time through
NewTop="${NewTop} "
NewBot="${NewBot} "
SKIPME=""
elif (( $((${TopRow:${f}:1}+${BotRow:${f}:1})) == 0 ))
then
# 0+0=0 so change to 9
NewTop="${NewTop}9"
NewBot="${NewBot}9"
SKIPME=1
else
# (1+0 or 0+1)!=0 so keep originals
NewTop="${NewTop}${TopRow:${f}:1}"
NewBot="${NewBot}${BotRow:${f}:1}"
SKIPME=1
fi
done
##--
# overwrite original file
printf "%sn%s" "$NewTop" "$NewBot" > $file
# if -f|--filename given print file name
[[ -n $DBG_FNAME ]] && echo "$file is complete"
done
##==
DOES EDIT FILES IN PLACE. Wouldn't be hard to have it make backups as it runs. Returns files exactly the way requested above.
edited 6 hours ago
Rui F Ribeiro
42.1k1484142
42.1k1484142
answered Oct 1 '17 at 2:47
EnterUserNameHereEnterUserNameHere
10818
10818
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f393455%2fconditional-replacing-rows-with-a-number%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
With millions of small files, you're at risk of running out of inodes. Check with
df /path/to/filesversusdf -i /path/to/files– glenn jackman
Sep 20 '17 at 19:33
I have a suspicion you would be better off rearchitecting, perhaps just to set up a database, but there's not enough information here to diagnose the real situation. ;) Good luck.
– Wildcard
Sep 20 '17 at 21:08