Pick up successive lines containing keywords in order The 2019 Stack Overflow Developer Survey...

Dropping list elements from nested list after evaluation

How come people say “Would of”?

Is it safe to harvest rainwater that fell on solar panels?

Did Scotland spend $250,000 for the slogan "Welcome to Scotland"?

What does Linus Torvalds mean when he says that Git "never ever" tracks a file?

How do PCB vias affect signal quality?

Can you cast a spell on someone in the Ethereal Plane, if you are on the Material Plane and have the True Seeing spell active?

Accepted by European university, rejected by all American ones I applied to? Possible reasons?

Why doesn't shell automatically fix "useless use of cat"?

How to translate "being like"?

What does もの mean in this sentence?

Why isn't the circumferential light around the M87 black hole's event horizon symmetric?

If I score a critical hit on an 18 or higher, what are my chances of getting a critical hit if I roll 3d20?

The phrase "to the numbers born"?

Worn-tile Scrabble

Why not take a picture of a closer black hole?

Likelihood that a superbug or lethal virus could come from a landfill

Ubuntu Server install with full GUI

Can withdrawing asylum be illegal?

Falsification in Math vs Science

Are spiders unable to hurt humans, especially very small spiders?

Why can't devices on different VLANs, but on the same subnet, communicate?

Relationship between Gromov-Witten and Taubes' Gromov invariant

Does adding complexity mean a more secure cipher?

Pick up successive lines containing keywords in order

The 2019 Stack Overflow Developer Survey Results Are InSingle record of a file getting splitted over multiple linesPick columns from a variable length csv fileAppend mth and nth columns of a file with the columns of another fileBash to join columns from multiple filesFind files that contain multiple keywords anywhere in the fileText file containing filenames and hashes - extracting lines with duplicate hashesHow to cat all lines together in file/for all files in a directoryLooking for way to move even lines to the beginning of odd lineschange and manipulate lines in a file using awkCompare two text files, extract matching rows of file2 plus additional rows

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}

I have a tab-separated file that looks as follows:

$ cat file

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558474.1  1159543 1160595 -4330977        polyketide synthase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558475.1  1160607 1161116 12      isoprenylcysteine carboxyl methyltransferase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558476.1  1161113 1162129 -3      NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011559726.1  2496640 2497560 1334511 polyketide synthase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011559727.1  2497568 2498122 8       isoprenylcysteine carboxyl methyltransferase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011562574.1  5526997 5528142 3028875 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]

I need to pick up successive lines that contain the keywords 'polyketide synthase', 'methyltransferase', and 'oxidoreductase' in that order, and write each of these sets into separate files for further analysis.

In this case, the input file would yield 2 output files which would look as follows:

$ cat file_1

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558474.1  1159543 1160595 -4330977        polyketide synthase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558475.1  1160607 1161116 12      isoprenylcysteine carboxyl methyltransferase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558476.1  1161113 1162129 -3      NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]



$ cat file_2

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011559726.1  2496640 2497560 1334511 polyketide synthase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011559727.1  2497568 2498122 8       isoprenylcysteine carboxyl methyltransferase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011562574.1  5526997 5528142 3028875 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]

I am having a hard time doing this using awk. Any suggestions?

P.S. I have other files that contain variable number of instances of the keywords in successive lines. This is where I am getting stuck.

edited 14 hours ago

asked 14 hours ago

BhushanDhamale

1624

What make those two output files file_1 & file_2 different fro each other? what other files you are talking about other files that contain variable number of instances of the keywords in successive lines? please edit your question and make it a little more clear.

– αғsнιη
13 hours ago

add a comment |

I have a tab-separated file that looks as follows:

$ cat file

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558474.1  1159543 1160595 -4330977        polyketide synthase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558475.1  1160607 1161116 12      isoprenylcysteine carboxyl methyltransferase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558476.1  1161113 1162129 -3      NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011559726.1  2496640 2497560 1334511 polyketide synthase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011559727.1  2497568 2498122 8       isoprenylcysteine carboxyl methyltransferase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011562574.1  5526997 5528142 3028875 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]

In this case, the input file would yield 2 output files which would look as follows:

$ cat file_1

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558474.1  1159543 1160595 -4330977        polyketide synthase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558475.1  1160607 1161116 12      isoprenylcysteine carboxyl methyltransferase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558476.1  1161113 1162129 -3      NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]



$ cat file_2

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011559726.1  2496640 2497560 1334511 polyketide synthase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011559727.1  2497568 2498122 8       isoprenylcysteine carboxyl methyltransferase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011562574.1  5526997 5528142 3028875 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]

I am having a hard time doing this using awk. Any suggestions?

P.S. I have other files that contain variable number of instances of the keywords in successive lines. This is where I am getting stuck.

edited 14 hours ago

asked 14 hours ago

BhushanDhamale

1624

What make those two output files file_1 & file_2 different fro each other? what other files you are talking about other files that contain variable number of instances of the keywords in successive lines? please edit your question and make it a little more clear.

– αғsнιη
13 hours ago

add a comment |

I have a tab-separated file that looks as follows:

$ cat file

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558474.1  1159543 1160595 -4330977        polyketide synthase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558475.1  1160607 1161116 12      isoprenylcysteine carboxyl methyltransferase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558476.1  1161113 1162129 -3      NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011559726.1  2496640 2497560 1334511 polyketide synthase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011559727.1  2497568 2498122 8       isoprenylcysteine carboxyl methyltransferase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011562574.1  5526997 5528142 3028875 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]

In this case, the input file would yield 2 output files which would look as follows:

$ cat file_1

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558474.1  1159543 1160595 -4330977        polyketide synthase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558475.1  1160607 1161116 12      isoprenylcysteine carboxyl methyltransferase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558476.1  1161113 1162129 -3      NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]



$ cat file_2

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011559726.1  2496640 2497560 1334511 polyketide synthase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011559727.1  2497568 2498122 8       isoprenylcysteine carboxyl methyltransferase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011562574.1  5526997 5528142 3028875 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]

I am having a hard time doing this using awk. Any suggestions?

P.S. I have other files that contain variable number of instances of the keywords in successive lines. This is where I am getting stuck.

edited 14 hours ago

asked 14 hours ago

BhushanDhamale

1624

I have a tab-separated file that looks as follows:

$ cat file

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558474.1  1159543 1160595 -4330977        polyketide synthase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558475.1  1160607 1161116 12      isoprenylcysteine carboxyl methyltransferase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558476.1  1161113 1162129 -3      NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011559726.1  2496640 2497560 1334511 polyketide synthase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011559727.1  2497568 2498122 8       isoprenylcysteine carboxyl methyltransferase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011562574.1  5526997 5528142 3028875 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]

In this case, the input file would yield 2 output files which would look as follows:

$ cat file_1

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558474.1  1159543 1160595 -4330977        polyketide synthase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558475.1  1160607 1161116 12      isoprenylcysteine carboxyl methyltransferase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011558476.1  1161113 1162129 -3      NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]



$ cat file_2

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011559726.1  2496640 2497560 1334511 polyketide synthase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011559727.1  2497568 2498122 8       isoprenylcysteine carboxyl methyltransferase [Mycobacterium]

GCF_000015405.1_ASM1540v1.dist_nbr_anntn        WP_011562574.1  5526997 5528142 3028875 NAD(P)/FAD-dependent oxidoreductase [Mycobacterium]

I am having a hard time doing this using awk. Any suggestions?

P.S. I have other files that contain variable number of instances of the keywords in successive lines. This is where I am getting stuck.

text-processing awk

edited 14 hours ago

asked 14 hours ago

BhushanDhamale

1624

edited 14 hours ago

asked 14 hours ago

BhushanDhamale

1624

edited 14 hours ago

asked 14 hours ago

BhushanDhamale

1624

asked 14 hours ago

BhushanDhamale

1624

asked 14 hours ago

BhushanDhamale

1624

What make those two output files file_1 & file_2 different fro each other? what other files you are talking about other files that contain variable number of instances of the keywords in successive lines? please edit your question and make it a little more clear.

– αғsнιη
13 hours ago

add a comment |

What make those two output files file_1 & file_2 different fro each other? what other files you are talking about other files that contain variable number of instances of the keywords in successive lines? please edit your question and make it a little more clear.

– αғsнιη
13 hours ago

What make those two output files file_1 & file_2 different fro each other? what other files you are talking about other files that contain variable number of instances of the keywords in successive lines? please edit your question and make it a little more clear.

– αғsнιη
13 hours ago

add a comment |

2 Answers
2

active

oldest

votes

You can change what you are searching for as the script progresses and change where you write to each time you cycle through your terms

awk 'BEGIN {

    result_file = 1;

    term_id = 1;

    search_terms[1] = "polyketide synthase";

    search_terms[2] = "methyltransferase";

    search_terms[3] = "oxidoreductase"

}

$0 ~ search_terms[term_id] { 

    print $0 >> "file_" result_file;

    term_id = term_id + 1;

    if (term_id > 3) {

        result_file =  result_file + 1;

        term_id = 1

    }

}' input_file

This will write to file_1, file_2...

answered 13 hours ago

Philip Couling

2,5541123

add a comment |

You might test the following code, where I split your keywords into an awk array named keys with N elements. everything starts with keys[1] and we set up a flag to check the next 1 to N-1 lines if they matches the corresponding values in the array keys [index from 2 to N], any mismatches before the N-1 line will reset this flag, if it reaches the N-1 line, then all are good for output (we also reset flag=0 here so a consecutive run of flag==1 never exceeds N-1 lines):

$ cat t24.awk

BEGIN{ 

    FS = OFS = "t";

    keywords = "polyketide synthase,methyltransferase,oxidoreductase";

    N = split(keywords, keys, ",")

}



# flag==1 means we are doing regex_match the next N-1 lines

# against corresponding array element in keys from [2:N] 

# once a unmatched found, turn off flag immediately

# if the flag==1 reached N-1 lines, then print the good match

flag {

    if($NF ~ keys[NR - start_line + 1]) {

        F = F ORS $0;

        if (NR == start_line+N-1) {print F > "out_" f++; flag = 0 }

        next

    } else {

        flag = 0;

    }

}



# set up the flag/start_line and reset F

$NF ~ keys[1] { flag = 1; F = $0; start_line= NR; }

Run the above code with awk -f t24.awk file.txt. You can set up keywords (comma delimited) from your shell(instead of hard-coded in the BEGIN block), and then use -v keywords="..." to make it more flexible.

answered 11 hours ago

jxc

1563

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f511906%2fpick-up-successive-lines-containing-keywords-in-order%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

You can change what you are searching for as the script progresses and change where you write to each time you cycle through your terms

awk 'BEGIN {

    result_file = 1;

    term_id = 1;

    search_terms[1] = "polyketide synthase";

    search_terms[2] = "methyltransferase";

    search_terms[3] = "oxidoreductase"

}

$0 ~ search_terms[term_id] { 

    print $0 >> "file_" result_file;

    term_id = term_id + 1;

    if (term_id > 3) {

        result_file =  result_file + 1;

        term_id = 1

    }

}' input_file

This will write to file_1, file_2...

answered 13 hours ago

Philip Couling

2,5541123

add a comment |

You can change what you are searching for as the script progresses and change where you write to each time you cycle through your terms

awk 'BEGIN {

    result_file = 1;

    term_id = 1;

    search_terms[1] = "polyketide synthase";

    search_terms[2] = "methyltransferase";

    search_terms[3] = "oxidoreductase"

}

$0 ~ search_terms[term_id] { 

    print $0 >> "file_" result_file;

    term_id = term_id + 1;

    if (term_id > 3) {

        result_file =  result_file + 1;

        term_id = 1

    }

}' input_file

This will write to file_1, file_2...

answered 13 hours ago

Philip Couling

2,5541123

add a comment |

You can change what you are searching for as the script progresses and change where you write to each time you cycle through your terms

awk 'BEGIN {

    result_file = 1;

    term_id = 1;

    search_terms[1] = "polyketide synthase";

    search_terms[2] = "methyltransferase";

    search_terms[3] = "oxidoreductase"

}

$0 ~ search_terms[term_id] { 

    print $0 >> "file_" result_file;

    term_id = term_id + 1;

    if (term_id > 3) {

        result_file =  result_file + 1;

        term_id = 1

    }

}' input_file

This will write to file_1, file_2...

answered 13 hours ago

Philip Couling

2,5541123

You can change what you are searching for as the script progresses and change where you write to each time you cycle through your terms

awk 'BEGIN {

    result_file = 1;

    term_id = 1;

    search_terms[1] = "polyketide synthase";

    search_terms[2] = "methyltransferase";

    search_terms[3] = "oxidoreductase"

}

$0 ~ search_terms[term_id] { 

    print $0 >> "file_" result_file;

    term_id = term_id + 1;

    if (term_id > 3) {

        result_file =  result_file + 1;

        term_id = 1

    }

}' input_file

This will write to file_1, file_2...

answered 13 hours ago

Philip Couling

2,5541123

answered 13 hours ago

Philip Couling

2,5541123

answered 13 hours ago

Philip Couling

2,5541123

answered 13 hours ago

Philip Couling

2,5541123

add a comment |

$ cat t24.awk

BEGIN{ 

    FS = OFS = "t";

    keywords = "polyketide synthase,methyltransferase,oxidoreductase";

    N = split(keywords, keys, ",")

}



# flag==1 means we are doing regex_match the next N-1 lines

# against corresponding array element in keys from [2:N] 

# once a unmatched found, turn off flag immediately

# if the flag==1 reached N-1 lines, then print the good match

flag {

    if($NF ~ keys[NR - start_line + 1]) {

        F = F ORS $0;

        if (NR == start_line+N-1) {print F > "out_" f++; flag = 0 }

        next

    } else {

        flag = 0;

    }

}



# set up the flag/start_line and reset F

$NF ~ keys[1] { flag = 1; F = $0; start_line= NR; }

answered 11 hours ago

jxc

1563

add a comment |

$ cat t24.awk

BEGIN{ 

    FS = OFS = "t";

    keywords = "polyketide synthase,methyltransferase,oxidoreductase";

    N = split(keywords, keys, ",")

}



# flag==1 means we are doing regex_match the next N-1 lines

# against corresponding array element in keys from [2:N] 

# once a unmatched found, turn off flag immediately

# if the flag==1 reached N-1 lines, then print the good match

flag {

    if($NF ~ keys[NR - start_line + 1]) {

        F = F ORS $0;

        if (NR == start_line+N-1) {print F > "out_" f++; flag = 0 }

        next

    } else {

        flag = 0;

    }

}



# set up the flag/start_line and reset F

$NF ~ keys[1] { flag = 1; F = $0; start_line= NR; }

answered 11 hours ago

jxc

1563

add a comment |

$ cat t24.awk

BEGIN{ 

    FS = OFS = "t";

    keywords = "polyketide synthase,methyltransferase,oxidoreductase";

    N = split(keywords, keys, ",")

}



# flag==1 means we are doing regex_match the next N-1 lines

# against corresponding array element in keys from [2:N] 

# once a unmatched found, turn off flag immediately

# if the flag==1 reached N-1 lines, then print the good match

flag {

    if($NF ~ keys[NR - start_line + 1]) {

        F = F ORS $0;

        if (NR == start_line+N-1) {print F > "out_" f++; flag = 0 }

        next

    } else {

        flag = 0;

    }

}



# set up the flag/start_line and reset F

$NF ~ keys[1] { flag = 1; F = $0; start_line= NR; }

answered 11 hours ago

jxc

1563

$ cat t24.awk

BEGIN{ 

    FS = OFS = "t";

    keywords = "polyketide synthase,methyltransferase,oxidoreductase";

    N = split(keywords, keys, ",")

}



# flag==1 means we are doing regex_match the next N-1 lines

# against corresponding array element in keys from [2:N] 

# once a unmatched found, turn off flag immediately

# if the flag==1 reached N-1 lines, then print the good match

flag {

    if($NF ~ keys[NR - start_line + 1]) {

        F = F ORS $0;

        if (NR == start_line+N-1) {print F > "out_" f++; flag = 0 }

        next

    } else {

        flag = 0;

    }

}



# set up the flag/start_line and reset F

$NF ~ keys[1] { flag = 1; F = $0; start_line= NR; }

answered 11 hours ago

jxc

1563

answered 11 hours ago

jxc

1563

answered 11 hours ago

jxc

1563

answered 11 hours ago

jxc

1563

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Mdthbs