Large molecule datasetWhat is the dataset with the largest number of molecules?How to choose the level of...

Can a human variant take proficiency in initiative?

Can authors email you PDFs of their textbook for free?

What is the motivation behind designing a control stick that does not move?

Why haven't the British protested Brexit as ardently as the Hong Kong protesters?

What are ways to record who took the pictures if a camera is used by multiple people?

Is there anything in the universe that cannot be compressed?

German equivalent to "going down the rabbit hole"

I was given someone else's visa, stamped in my passport

In Mathematics, what is the standing of the journal Proc. AMS?

What is the practical impact of using System.Random which is not cryptographically random?

What kind of electrical outlet is this? Red, winking-face shape

Must a leaky tire plug be redone completely?

Why don't "echo -e" commands seem to produce the right output?

Could a complex system of reaction wheels be used to propel a spacecraft?

Confidence intervals for the mean of a sample of counts

Can UV radiation be safe for the skin?

Where should I draw the line on follow up questions from previous employer

How can I store milk for long periods of time?

Does using composite keys violate 2NF

Divide Numbers by 0

Can the inductive kick be discharged without a freewheeling diode, in this example?

A word for the urge to do the opposite

When you have to wait for a short time

Does Q ever actually lie?



Large molecule dataset


What is the dataset with the largest number of molecules?How to choose the level of theory for modelling reactions of polymers?The “rules” for LCAOs in Molecular Orbital TheoryQuick-and-Dirty Molecular Dynamics by Mass-Weighted Atom Translations?Tunneling corrections in reaction ratesGaussian scan function help for constructing input file(Computationally) finding similarity between two organic compoundsGaussian calculation problem - Maxcycle values for Opt and SCFGaussian parameters for IP and EA calculationDifference between Force Field and topology, and other related questionsRadial pair distribution function (VMD). How to define it for a water NaCl system with multiple Na and Cl?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}







1












$begingroup$


I have been testing a machine learning approach for molecular energy prediction. The current dataset that I have is QM9, which is consist of molecules with up to 9 heavy atoms. I was wondering if anyone know of the largest molecule datasets available. I will be testing ZINC, which has up to 38 atoms. Anyone knows of a larger dataset available?? Thanks!










share|improve this question







New contributor



Blade is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$














  • $begingroup$
    In retrospect -- after giving providing an answer (molecules encoded as SMILES), while «user1271772» provided an other answer about molecules keeping geometries, does your question for a larger data set refer to a dataset containing molecules larger than 38 atoms, or a dataset with more molecules than ZINC?
    $endgroup$
    – Buttonwood
    5 hours ago












  • $begingroup$
    Larger than 38 atoms. I was gonna say, but I appreciated your detail answer very much
    $endgroup$
    – Blade
    5 hours ago


















1












$begingroup$


I have been testing a machine learning approach for molecular energy prediction. The current dataset that I have is QM9, which is consist of molecules with up to 9 heavy atoms. I was wondering if anyone know of the largest molecule datasets available. I will be testing ZINC, which has up to 38 atoms. Anyone knows of a larger dataset available?? Thanks!










share|improve this question







New contributor



Blade is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$














  • $begingroup$
    In retrospect -- after giving providing an answer (molecules encoded as SMILES), while «user1271772» provided an other answer about molecules keeping geometries, does your question for a larger data set refer to a dataset containing molecules larger than 38 atoms, or a dataset with more molecules than ZINC?
    $endgroup$
    – Buttonwood
    5 hours ago












  • $begingroup$
    Larger than 38 atoms. I was gonna say, but I appreciated your detail answer very much
    $endgroup$
    – Blade
    5 hours ago














1












1








1





$begingroup$


I have been testing a machine learning approach for molecular energy prediction. The current dataset that I have is QM9, which is consist of molecules with up to 9 heavy atoms. I was wondering if anyone know of the largest molecule datasets available. I will be testing ZINC, which has up to 38 atoms. Anyone knows of a larger dataset available?? Thanks!










share|improve this question







New contributor



Blade is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$




I have been testing a machine learning approach for molecular energy prediction. The current dataset that I have is QM9, which is consist of molecules with up to 9 heavy atoms. I was wondering if anyone know of the largest molecule datasets available. I will be testing ZINC, which has up to 38 atoms. Anyone knows of a larger dataset available?? Thanks!







quantum-chemistry computational-chemistry databases






share|improve this question







New contributor



Blade is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.










share|improve this question







New contributor



Blade is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








share|improve this question




share|improve this question






New contributor



Blade is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








asked 9 hours ago









BladeBlade

62 bronze badges




62 bronze badges




New contributor



Blade is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




New contributor




Blade is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

















  • $begingroup$
    In retrospect -- after giving providing an answer (molecules encoded as SMILES), while «user1271772» provided an other answer about molecules keeping geometries, does your question for a larger data set refer to a dataset containing molecules larger than 38 atoms, or a dataset with more molecules than ZINC?
    $endgroup$
    – Buttonwood
    5 hours ago












  • $begingroup$
    Larger than 38 atoms. I was gonna say, but I appreciated your detail answer very much
    $endgroup$
    – Blade
    5 hours ago


















  • $begingroup$
    In retrospect -- after giving providing an answer (molecules encoded as SMILES), while «user1271772» provided an other answer about molecules keeping geometries, does your question for a larger data set refer to a dataset containing molecules larger than 38 atoms, or a dataset with more molecules than ZINC?
    $endgroup$
    – Buttonwood
    5 hours ago












  • $begingroup$
    Larger than 38 atoms. I was gonna say, but I appreciated your detail answer very much
    $endgroup$
    – Blade
    5 hours ago
















$begingroup$
In retrospect -- after giving providing an answer (molecules encoded as SMILES), while «user1271772» provided an other answer about molecules keeping geometries, does your question for a larger data set refer to a dataset containing molecules larger than 38 atoms, or a dataset with more molecules than ZINC?
$endgroup$
– Buttonwood
5 hours ago






$begingroup$
In retrospect -- after giving providing an answer (molecules encoded as SMILES), while «user1271772» provided an other answer about molecules keeping geometries, does your question for a larger data set refer to a dataset containing molecules larger than 38 atoms, or a dataset with more molecules than ZINC?
$endgroup$
– Buttonwood
5 hours ago














$begingroup$
Larger than 38 atoms. I was gonna say, but I appreciated your detail answer very much
$endgroup$
– Blade
5 hours ago




$begingroup$
Larger than 38 atoms. I was gonna say, but I appreciated your detail answer very much
$endgroup$
– Blade
5 hours ago










2 Answers
2






active

oldest

votes


















3













$begingroup$

The ISOL24 database (http://www.thch.uni-bonn.de/tc.old/downloads/GMTKN/GMTKN55/ISOL24.html) contains molecules with up to 81 atoms!






share|improve this answer











$endgroup$























    2













    $begingroup$

    This sounds like you were exploring work at least related to the work by the Lilienfeld group equally hosting a dedicated site here about data sets already used in their earlier and ongoing exploration of chemical space, programs used to work with the data, and publications.



    To go considerably higher in molecule count than QM9, you could either go for




    • GDB-11 about small organic molecules up to 11 atoms of C, N, O and F which «contains 26.4 million molecules (110.9 million stereoisomers), including three- and four-membered rings and triple bonds», described in J. Chem. Inf. Model. 2007, 47, 342-353 (doi.org/10.1021/ci600423u), or


    • GDB-13, about «small organic molecules up to 13 atoms of C, N, O, S and Cl following simple chemical stability and synthetic feasibility rules. With 977 468 314 structures, GDB-13 is the largest publicly available small organic molecule database to date». This one was described in J. Am. Chem. Soc. 2009, 131, 8732-8733 (doi.org/10.1021/ja902302h)



    Convienently, you can download both -- including sub-sets like «containing only carbon and nitrogen», or «chlorine and sulfur», or «fragrance like» in case you don't want to fetch 2GB of already compressed data -- from the Reymond group. To quote: «All the molecules are stored in dearomatized, canonized SMILES format.»



    The even larger GDB-17 («of up to 17 atoms of C, N, O, S, and halogens» with an universe of 166 billion entries, described in J. Chem. Inf. Model. 2012, 52, 2864-2875, [doi.org/10.1021/ci300415d, open access]) is accessible to the public on this site as a 50 million random subset only, partly because the gzipped archive is about 400GByte. Among the publications citing this work is for example the Lilienfeld group again for machine learning (J. Chem. Phys. 143, 084111 (2015), doi.org/10.1063/1.4928757).






    share|improve this answer









    $endgroup$















    • $begingroup$
      I don't think any of these have more than 81 atoms!
      $endgroup$
      – user1271772
      4 hours ago










    • $begingroup$
      Ok I see the confusion. @Buttonwood perhaps you can answer this question: chemistry.stackexchange.com/questions/119804/…
      $endgroup$
      – user1271772
      4 hours ago














    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "431"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });






    Blade is a new contributor. Be nice, and check out our Code of Conduct.










    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fchemistry.stackexchange.com%2fquestions%2f119797%2flarge-molecule-dataset%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3













    $begingroup$

    The ISOL24 database (http://www.thch.uni-bonn.de/tc.old/downloads/GMTKN/GMTKN55/ISOL24.html) contains molecules with up to 81 atoms!






    share|improve this answer











    $endgroup$




















      3













      $begingroup$

      The ISOL24 database (http://www.thch.uni-bonn.de/tc.old/downloads/GMTKN/GMTKN55/ISOL24.html) contains molecules with up to 81 atoms!






      share|improve this answer











      $endgroup$


















        3














        3










        3







        $begingroup$

        The ISOL24 database (http://www.thch.uni-bonn.de/tc.old/downloads/GMTKN/GMTKN55/ISOL24.html) contains molecules with up to 81 atoms!






        share|improve this answer











        $endgroup$



        The ISOL24 database (http://www.thch.uni-bonn.de/tc.old/downloads/GMTKN/GMTKN55/ISOL24.html) contains molecules with up to 81 atoms!







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 1 hour ago

























        answered 6 hours ago









        user1271772user1271772

        5884 silver badges13 bronze badges




        5884 silver badges13 bronze badges




























            2













            $begingroup$

            This sounds like you were exploring work at least related to the work by the Lilienfeld group equally hosting a dedicated site here about data sets already used in their earlier and ongoing exploration of chemical space, programs used to work with the data, and publications.



            To go considerably higher in molecule count than QM9, you could either go for




            • GDB-11 about small organic molecules up to 11 atoms of C, N, O and F which «contains 26.4 million molecules (110.9 million stereoisomers), including three- and four-membered rings and triple bonds», described in J. Chem. Inf. Model. 2007, 47, 342-353 (doi.org/10.1021/ci600423u), or


            • GDB-13, about «small organic molecules up to 13 atoms of C, N, O, S and Cl following simple chemical stability and synthetic feasibility rules. With 977 468 314 structures, GDB-13 is the largest publicly available small organic molecule database to date». This one was described in J. Am. Chem. Soc. 2009, 131, 8732-8733 (doi.org/10.1021/ja902302h)



            Convienently, you can download both -- including sub-sets like «containing only carbon and nitrogen», or «chlorine and sulfur», or «fragrance like» in case you don't want to fetch 2GB of already compressed data -- from the Reymond group. To quote: «All the molecules are stored in dearomatized, canonized SMILES format.»



            The even larger GDB-17 («of up to 17 atoms of C, N, O, S, and halogens» with an universe of 166 billion entries, described in J. Chem. Inf. Model. 2012, 52, 2864-2875, [doi.org/10.1021/ci300415d, open access]) is accessible to the public on this site as a 50 million random subset only, partly because the gzipped archive is about 400GByte. Among the publications citing this work is for example the Lilienfeld group again for machine learning (J. Chem. Phys. 143, 084111 (2015), doi.org/10.1063/1.4928757).






            share|improve this answer









            $endgroup$















            • $begingroup$
              I don't think any of these have more than 81 atoms!
              $endgroup$
              – user1271772
              4 hours ago










            • $begingroup$
              Ok I see the confusion. @Buttonwood perhaps you can answer this question: chemistry.stackexchange.com/questions/119804/…
              $endgroup$
              – user1271772
              4 hours ago
















            2













            $begingroup$

            This sounds like you were exploring work at least related to the work by the Lilienfeld group equally hosting a dedicated site here about data sets already used in their earlier and ongoing exploration of chemical space, programs used to work with the data, and publications.



            To go considerably higher in molecule count than QM9, you could either go for




            • GDB-11 about small organic molecules up to 11 atoms of C, N, O and F which «contains 26.4 million molecules (110.9 million stereoisomers), including three- and four-membered rings and triple bonds», described in J. Chem. Inf. Model. 2007, 47, 342-353 (doi.org/10.1021/ci600423u), or


            • GDB-13, about «small organic molecules up to 13 atoms of C, N, O, S and Cl following simple chemical stability and synthetic feasibility rules. With 977 468 314 structures, GDB-13 is the largest publicly available small organic molecule database to date». This one was described in J. Am. Chem. Soc. 2009, 131, 8732-8733 (doi.org/10.1021/ja902302h)



            Convienently, you can download both -- including sub-sets like «containing only carbon and nitrogen», or «chlorine and sulfur», or «fragrance like» in case you don't want to fetch 2GB of already compressed data -- from the Reymond group. To quote: «All the molecules are stored in dearomatized, canonized SMILES format.»



            The even larger GDB-17 («of up to 17 atoms of C, N, O, S, and halogens» with an universe of 166 billion entries, described in J. Chem. Inf. Model. 2012, 52, 2864-2875, [doi.org/10.1021/ci300415d, open access]) is accessible to the public on this site as a 50 million random subset only, partly because the gzipped archive is about 400GByte. Among the publications citing this work is for example the Lilienfeld group again for machine learning (J. Chem. Phys. 143, 084111 (2015), doi.org/10.1063/1.4928757).






            share|improve this answer









            $endgroup$















            • $begingroup$
              I don't think any of these have more than 81 atoms!
              $endgroup$
              – user1271772
              4 hours ago










            • $begingroup$
              Ok I see the confusion. @Buttonwood perhaps you can answer this question: chemistry.stackexchange.com/questions/119804/…
              $endgroup$
              – user1271772
              4 hours ago














            2














            2










            2







            $begingroup$

            This sounds like you were exploring work at least related to the work by the Lilienfeld group equally hosting a dedicated site here about data sets already used in their earlier and ongoing exploration of chemical space, programs used to work with the data, and publications.



            To go considerably higher in molecule count than QM9, you could either go for




            • GDB-11 about small organic molecules up to 11 atoms of C, N, O and F which «contains 26.4 million molecules (110.9 million stereoisomers), including three- and four-membered rings and triple bonds», described in J. Chem. Inf. Model. 2007, 47, 342-353 (doi.org/10.1021/ci600423u), or


            • GDB-13, about «small organic molecules up to 13 atoms of C, N, O, S and Cl following simple chemical stability and synthetic feasibility rules. With 977 468 314 structures, GDB-13 is the largest publicly available small organic molecule database to date». This one was described in J. Am. Chem. Soc. 2009, 131, 8732-8733 (doi.org/10.1021/ja902302h)



            Convienently, you can download both -- including sub-sets like «containing only carbon and nitrogen», or «chlorine and sulfur», or «fragrance like» in case you don't want to fetch 2GB of already compressed data -- from the Reymond group. To quote: «All the molecules are stored in dearomatized, canonized SMILES format.»



            The even larger GDB-17 («of up to 17 atoms of C, N, O, S, and halogens» with an universe of 166 billion entries, described in J. Chem. Inf. Model. 2012, 52, 2864-2875, [doi.org/10.1021/ci300415d, open access]) is accessible to the public on this site as a 50 million random subset only, partly because the gzipped archive is about 400GByte. Among the publications citing this work is for example the Lilienfeld group again for machine learning (J. Chem. Phys. 143, 084111 (2015), doi.org/10.1063/1.4928757).






            share|improve this answer









            $endgroup$



            This sounds like you were exploring work at least related to the work by the Lilienfeld group equally hosting a dedicated site here about data sets already used in their earlier and ongoing exploration of chemical space, programs used to work with the data, and publications.



            To go considerably higher in molecule count than QM9, you could either go for




            • GDB-11 about small organic molecules up to 11 atoms of C, N, O and F which «contains 26.4 million molecules (110.9 million stereoisomers), including three- and four-membered rings and triple bonds», described in J. Chem. Inf. Model. 2007, 47, 342-353 (doi.org/10.1021/ci600423u), or


            • GDB-13, about «small organic molecules up to 13 atoms of C, N, O, S and Cl following simple chemical stability and synthetic feasibility rules. With 977 468 314 structures, GDB-13 is the largest publicly available small organic molecule database to date». This one was described in J. Am. Chem. Soc. 2009, 131, 8732-8733 (doi.org/10.1021/ja902302h)



            Convienently, you can download both -- including sub-sets like «containing only carbon and nitrogen», or «chlorine and sulfur», or «fragrance like» in case you don't want to fetch 2GB of already compressed data -- from the Reymond group. To quote: «All the molecules are stored in dearomatized, canonized SMILES format.»



            The even larger GDB-17 («of up to 17 atoms of C, N, O, S, and halogens» with an universe of 166 billion entries, described in J. Chem. Inf. Model. 2012, 52, 2864-2875, [doi.org/10.1021/ci300415d, open access]) is accessible to the public on this site as a 50 million random subset only, partly because the gzipped archive is about 400GByte. Among the publications citing this work is for example the Lilienfeld group again for machine learning (J. Chem. Phys. 143, 084111 (2015), doi.org/10.1063/1.4928757).







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered 5 hours ago









            ButtonwoodButtonwood

            10.9k1 gold badge22 silver badges47 bronze badges




            10.9k1 gold badge22 silver badges47 bronze badges















            • $begingroup$
              I don't think any of these have more than 81 atoms!
              $endgroup$
              – user1271772
              4 hours ago










            • $begingroup$
              Ok I see the confusion. @Buttonwood perhaps you can answer this question: chemistry.stackexchange.com/questions/119804/…
              $endgroup$
              – user1271772
              4 hours ago


















            • $begingroup$
              I don't think any of these have more than 81 atoms!
              $endgroup$
              – user1271772
              4 hours ago










            • $begingroup$
              Ok I see the confusion. @Buttonwood perhaps you can answer this question: chemistry.stackexchange.com/questions/119804/…
              $endgroup$
              – user1271772
              4 hours ago
















            $begingroup$
            I don't think any of these have more than 81 atoms!
            $endgroup$
            – user1271772
            4 hours ago




            $begingroup$
            I don't think any of these have more than 81 atoms!
            $endgroup$
            – user1271772
            4 hours ago












            $begingroup$
            Ok I see the confusion. @Buttonwood perhaps you can answer this question: chemistry.stackexchange.com/questions/119804/…
            $endgroup$
            – user1271772
            4 hours ago




            $begingroup$
            Ok I see the confusion. @Buttonwood perhaps you can answer this question: chemistry.stackexchange.com/questions/119804/…
            $endgroup$
            – user1271772
            4 hours ago










            Blade is a new contributor. Be nice, and check out our Code of Conduct.










            draft saved

            draft discarded


















            Blade is a new contributor. Be nice, and check out our Code of Conduct.













            Blade is a new contributor. Be nice, and check out our Code of Conduct.












            Blade is a new contributor. Be nice, and check out our Code of Conduct.
















            Thanks for contributing an answer to Chemistry Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fchemistry.stackexchange.com%2fquestions%2f119797%2flarge-molecule-dataset%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Taj Mahal Inhaltsverzeichnis Aufbau | Geschichte | 350-Jahr-Feier | Heutige Bedeutung | Siehe auch |...

            Baia Sprie Cuprins Etimologie | Istorie | Demografie | Politică și administrație | Arii naturale...

            Ciclooctatetraenă Vezi și | Bibliografie | Meniu de navigare637866text4148569-500570979m