Registration Dossier

Data platform availability banner - registered substances factsheets

Please be aware that this old REACH registration data factsheet is no longer maintained; it remains frozen as of 19th May 2023.

The new ECHA CHEM database has been released by ECHA, and it now contains all REACH registration data. There are more details on the transition of ECHA's published data to ECHA CHEM here.

Diss Factsheets

Physical & Chemical properties

Water solubility

Currently viewing:

Administrative data

Link to relevant study record(s)

Referenceopen allclose all

Endpoint:
water solubility
Type of information:
experimental study
Adequacy of study:
key study
Study period:
October 2018
Reliability:
1 (reliable without restriction)
Rationale for reliability incl. deficiencies:
guideline study
Qualifier:
according to guideline
Guideline:
OECD Guideline 105 (Water Solubility)
GLP compliance:
no
Other quality assurance:
other: ISO 9001 certification
Type of method:
flask method
Key result
Water solubility:
< 1 mg/L
Conc. based on:
test mat.
Incubation duration:
24 h
pH:
ca. 7.8
Remarks on result:
not determinable because of methodological limitations
Conclusions:
The water solubitiy at 20 ºC is < 1 mg/L.
Executive summary:

The water solubility at 20 ºC is < 1 mg/L.

Endpoint:
water solubility
Type of information:
(Q)SAR
Adequacy of study:
weight of evidence
Study period:
December 2018
Reliability:
2 (reliable with restrictions)
Rationale for reliability incl. deficiencies:
results derived from a valid (Q)SAR model, but not (completely) falling into its applicability domain, with adequate and reliable documentation / justification
Justification for type of information:
1. SOFTWARE:
programs WSKOWIN and WATERNT included in EPISUITE (Estimation Programs Interface Suite™ for Microsoft® Windows, v 4.11)


2. MODEL (incl. version number)

WSKowWin v1.42 estimates the water solubility (WSol) of an organic compound using the compound’s log octanol-water partition coefficient (Kow). WSKOWWIN requires only a chemical structure to estimate Wsol.
The estimation methodology used by WSKOWWIN (Meylan and Howard, 1994a,b) is described in the following document prepared for the U.S. Environmental Protection Agency (OPPT): Upgrade of PCGEMS Water Solubility Estimation Method (May 1994). A companion document (Validation of Water Solubility Estimation Methods Using Log Kow for Application in PCGEMS & EPI) also discusses the methodology. A journal article that describes the methodology is also available (Meylan et al., 1996).

The Water Solubility Program (WATERNT v1.01) estimates the water solubility of organic compounds at 25ºC. WATERNT requires only a chemical structure to estimate a solubility. Structures are entered into WATERNT by SMILES (Simplified Molecular Input Line Entry System) notations.


3. SMILES OR OTHER IDENTIFIERS USED AS INPUT FOR THE MODEL

SMILES : O=C(CCCCCC)OCC(COC(=O)CCCCCC)(COC(=O)CCCCCC)COCC(COC(=O)CCCCCC)(COC(=O)CCCCCC)COC(=O)CCCCCC

4. SCIENTIFIC VALIDITY OF THE (Q)SAR MODEL

- Defined endpoint:
Water solubility of an organic compound, maximum amount of substance that can be dissolved in water (g/L)

- Unambiguous algorithm:
WSKOWWIN estimates the water solubility (WSol) of an organic compound using the compound’s log octanol-water partition coefficient (Kow). WSKOWWIN requires only a chemical structure to estimate Wsol.
The estimation methodology used by WSKOWWIN (Meylan and Howard, 1994a,b) is described in the following document prepared for the U.S. Environmental Protection Agency (OPPT): Upgrade of PCGEMS Water Solubility Estimation Method (May 1994). A companion document (Validation of Water Solubility Estimation Methods Using Log Kow for Application in PCGEMS & EPI) also discusses the methodology. A journal article that describes the methodology is also available (Meylan et al., 1996).

WSKOWWIN uses equations 19 and 20 from these documents because they are the best available equations for estimating Wsol:
Equation 19 is: log S (mol/L) = 0.796 - 0.854 log Kow - 0.00728 MW + Corrections
Equation 20 is: log S (mol/L) = 0.693 - 0.96 log Kow - 0.0092(Tm-25) - 0.00314 MW + Corrections
where MW is molecular weight, Tm is melting point (MP) in deg C (used only for solids). Corrections are applied to 15 structure types (eg. alcohols, acids, selected phenols, nitros, amines, alkyl pyridines, amino acids, PAHS, multi-nitrogen types, etc).

Equation 20 is used when a measured MP is available (like in the case of study, EC 945-883-1); otherwise, equation 19 is used. These equations were derived from a dataset consisting of 1450 compounds with measured log Kow, water sol, and MP. Eq 20 has the following statistical accuracy: correlation coefficient (r2) = 0.97, standard deviation = 0.409 log units, and absolute mean error = 0.313 log units. Application to a validation dataset of 817 compounds gave the following statistical accuracy: correlation coefficient (r2) = 0.902, standard deviation = 0.615 log units, and absolute mean error = 0.480 log units.

WSKOWWIN estimates a log Kow for every SMILES notation by using the estimation engine from the KOWWIN Program (SRC, 2000). WSKOWWIN also automatically retrieves experimental log Kow values from a database containing more than 13200 organic compounds with reliably measured values. When a SMILES structure matches a database structure (via an exact atom-to-atom connection match), the experimental log Kow value is retrieved and used to predict Wsol rather than the estimated value.

WSKOWWIN v1.4 includes an experimental water solubility database of 6230 compounds. When experimental data are available for the SMILES being estimated, the data are retrieved and shown in the Results Window.
A complete description of the estimation methodology used by WSKOWWIN is available in two documents prepared for the U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics (Meylan and Howard, 1994a,b). A journal article that describes the methodology is also available (Meylan et al., 1996).  The WSKOWWIN program estimates the water solubility of an organic compound using the compounds log octanol-water partition coefficient (log Kow). A brief description is given below.

Data Collection
A database of more than 8400 compounds with reliably measured log Kow values had already been compiled from available sources.  Most experimental values were taken from a "star-list" compilation of Hansch and Leo (1985) that had already been critically evaluated (see also Hansch et al, 1995) or an extensive compilation by Sangster (1993) that includes many "recommended" values based upon critical evaluation.  Other log Kow values were taken from sources located through the Environmental Fate Data Base (EFDB) system (Howard et al, 1982, 1986).  A few values were taken from Section 4a, 8d, and 8e submissions the to U.S. EPA under the Toxic Substances Control Act (see http://www.syrres.com/esc/tscats_info.htm).

Water solubilities were collected from the AQUASOL dATAbASETM of the University of Arizona (Yalkowsky and Dannenfelser, 1990), Syracuse Research Corporation's PHYSPROP© Database (SRC,1994), and sources located through the Environmental Fate Data Base (EFDB) system (Howard et al, 1982, 1986).  Water solubilities were primarily constrained to the 20-25oC temperature range with 25oC being preferred.

Melting points were collected from sources such as AQUASOL dATAbASETM,  PHYSPROP©, and EDFB as well as the Handbook of Chemistry and Physics (Lide, 1990) and the Aldrich Catalog (Aldrich, 1992).
Regression & Results

A dataset of 1450 compounds (941 solids, 509 liquids) having reliably measured water solubility, log Kow and melting point was used as the training set for developing the new estimation algorithms for water solubility.  Standard linear regressions were used to fit  water solubility (as log S) with log Kow, melting point and molecular weight.

Residual errors from the initial regression fit were examined for compounds sharing common structural features with relatively consistent errors.  On that basis, 12 compound classes were initially identified and added to the regression to comprise a multi-linear regression including log Kow, melting point and/or molecular weight plus 12 correction factors.  Each correction factor is counted a maximum of once per structure [if applicable], no matter how many times the applicable fragment occurs.  For example, the nitro factor in 1,4-dinitrobenzene is counted just once.  A compound either contains a correction factor or it doesn't; therefore, the matrix for the multi-linear regression contained either a 0 or 1 for each correction factor. Appendix E describes the correction factors and coefficients used by WSKOWWIN.

WSKOWWIN estimates water solubility for any compound with one of two possible equations.  The equations are equations 19 and 20 from Meylan and Howard (1994a) or equations 11 and 12 from the journal article (Meylan et al., 1996).  The equations are:

     log S (mol/L)  =  0.796 - 0.854 log Kow - 0.00728 MW + ΣCorrections

    log S (mol/L)  =  0.693 - 0.96 log Kow - 0.0092(Tm-25) - 0.00314 MW + ΣCorrections

(where MW is molecular weight, Tm is melting point (MP) in deg C [used only for solids]) ... Summation of Corrections (ΣCorrections) are applied as described in Appendix E.   When a measured MP is available, that equation is used; otherwise, the equation with just MW is used.


The WATERNT program and estimation methodology were developed at Syracuse Research Corporation for the US Environmental Protection Agency (described inthe document Preliminary Report: Water Solubility Estimation by Base Compound Modification (Sept 1995). The estimation methodology is based upon a "fragment constant" method very similar to the method of the KOWWIN Program which estimates octanol-water partition coefficients. A journal article by Meylan and Howard (1995) describes the KOWWIN program methodology.

In a "fragment constant" method, a structure is divided into fragments (atom or larger functional groups) and coefficient values of each fragment or group are summed together to yield the solubility estimate. We call WATERNT’s methodology the Atom/Fragment Contribution (AFC) method. Coefficients for individual fragments and groups in WATERNT were derived by multiple regression of 1000 reliably measured water solubility values.

To estimate water solubility, WATERNT initially separates a molecule into distinct atom/fragments. In general, each non-hydrogen atom (e.g. carbon, nitrogen, oxygen, sulfur, etc.) in a structure is a "core" for a fragment; the exact fragment is determined by what is connected to the atom. Several functional groups are treated as core "atoms"; these include carbonyl (C=O), thiocarbonyl (C=S), nitro (-NO2), nitrate (ONO2), cyano (-C/N), and isothiocyanate (-N=C=S). Connections to each core "atom" are either general or specific; specific connections take precedence over general connections. For example, aromatic carbon, aromatic oxygen and aromatic sulfur atoms have nothing but general connections; i.e., the fragment is the same no matter what is connected to the atom. In contrast, there are 5 aromatic nitrogen fragments: (a) in a five-member ring, (b) in a six-member ring, (c) if the nitrogen is an oxide-type {i.e. pyridine oxide}, (d) if the nitrogen has a fused ring location {i.e. indolizine}, and (e) if the nitrogen has a +5 valence {i.e. N-methyl pyridinium iodide}; since the oxide-type is most specific, it takes precedence over the other four. The aliphatic carbon atom is another example; it does not matter what is connected to -CH3, -CH2-, or -CH< , the fragment is the same; however, an aliphatic carbon with no hydrogens has two possible fragments: (a) if there are four single bonds with 3 or more carbon connections and (b) any other not meeting the first criteria.

It became apparent, for various types of structures, that water solubility estimates made from atom/fragment values alone could or needed to be improved by inclusion of substructures larger or more complex than "atoms"; hence, correction factors were added to the AFC method. The term "correction factor" is appropriate because their values are derived from the differences between the water solubility estimates from atoms alone and the measured water solubility values. The correction factors have two main groupings: first, factors involving aromatic ring substituent positions and second, miscellaneous factors. In general, the correction factors are values for various steric interactions, hydrogen-bondings, and effects from polar functional substructures. Individual correction factors were selected through a tedious process of correlating the differences (between solubility estimates from atom/fragments alone and measured solubility values) with common substructures.

Results of two successive multiple regressions (first for atom/fragments and second for correction factors) yield the following general equation for estimating water solubility of any organic compound:

log WatSol (moles/L) = Σ(fi * ni) + Σ(cj * nj) + 0.24922
(n = 1128, correlation coef (r2) = 0.940, standard deviation = 0.537, avg deviation = 0.355)
where Σ(fi * ni) is the summation of fi (the coefficient for each atom/fragment) times ni (the number of times the atom/fragment occurs in the structure) and Σ(cj * nj) is the summation of cj (the coefficient for each correction factor) times nj (the number of times the correction factor is applied in the molecule).

Additionally, WATERNT automatically retrieves experimental water solubility values from a database containing more than 6200 organic compounds with measured values. When a SMILES structure matches a database structure (via an exact atom-to-atom connection match), the experimental water solubility value is retrieved and shown in the Results Window.


 - Defined domain of applicability:
(see point 5 below)

- Appropriate measures of goodness-of-fit and robustness and predictivity:
WSKOWWIN
The regression equations used by the WSKOWWIN program were trained with a dataset of 1450 compounds. WSKOWWIN estimates water solubility with one of two possible equations.  When an experimental melting point is available, WSKOWWIN applies the equation containing both a melting point and the molecular weight (MW) parameters.  In the absence of a melting point, the equation containing just the molecular weight is used to make the estimate.  All compounds in the 1450 compound training set have known melting points or are known to be liquids at 25oC.  The accuracy statistics for the two equations are as follows:

Melt Pt + MW MW only
r2 0.970 0.934
std deviation 0.409 0.585
avg deviation 0.313 0.442

Validation

The WSKOWWIN estimation equations were initially validated on two datasets of compounds that were not included in the model training.  A relatively small dataset was tested that consisted of 85 compounds having experimental log Kow values, but no available melting points.  Many compounds in the 85 compound test set decompose before melting and would theoretically have very high melting points (e.g. amino acids and compounds having multiple nitrogens).  The accuracy statistics for the equation used by WSKOWWIN are:

number 85
r2 0.865
std deviation 0.961
avg deviation 0.714
 
A much larger dataset of 817 compounds was also tested.  All 817 compounds had experimental melting points, but none of the 817 compounds had a reliable experimental log Kow.  The log Kow values used for the validation-testing were estimated (primarily using the KOWWIN program available at that time); therefore, the water solubility estimates are based on estimates for log Kow.  Typically, estimates based on estimates reduce estimation accuracy, but this type of validation can provide insight into the ability of the method.  The accuracy statistics for this dataset are:

number 817
r2 0.902
std deviation 0.615
avg deviation 0.480
  
Availability of Training & Validation Datasets

The complete datasets used to train and validate the SAR equations used by the WSKOWWIN program are available in two documents prepared for the U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics (Meylan and Howard, 1994a,b).  These documents, which also detail the estimation methodology, can be downloaded from the Internet at:

http://esc.syrres.com/interkow/EpiSuiteData.htm

Substructure searchable formats of the data can be download at:
http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm


WATERNT
Training Set Data
The following figure illustrates the estimation accuracy of the current WATERNT Program training set:
n= 1128
r2= 0.940
std dev= 0.537
avg dev= 0.355

Validation Data Sets
Currently, WATERNT has been tested on a validation dataset of 4,636 compounds not included in the training set.  These 4636 compounds were collected from the PHYSPROP Database.  Various compounds having experimental water solubility values in PHYSPROP were excluded from the validation set (these included the majority of compounds that were inorganics, had measurements outside a temperature range of 10 to 40 degrees C, or were measured at specific pH values that might skew an estimation comparison).  The complete training and validation data sets are available as noted below.
 
The validation set includes a diverse selection of chemical structures that rigorously test the predictive accuracy of any model.  It contains many chemicals that are similar in structure to chemicals in the training set, but also many chemicals that are different from and structurally more complex than chemicals in the training set.  Statistical performance for estimated vs experimental log WatSol (moles/L) are as follows:
n= 4636
r2= 0.815
std dev= 1.045
avg dev= 0.796

Availability of Training and Validation Datasets

The complete training and validation data sets can be downloaded from the Internet at:
http://esc.syrres.com/interkow/EpiSuiteData.htm

Substructure searchable formats of the data can be downloaded at:
http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm


- Mechanistic interpretation:


5. APPLICABILITY DOMAIN
WSKOWWIN
Appendix E gives the number compounds in the 1450 compound training set containing each of the correction factors.  The WSKOWWIN program applies an individual correction factor only once per structure [if at all] regardless of how many instances of the applicable structural feature occur in the structure.  The minimum number of instances is zero and the maximum is one.

 

Range of water solubilities in the Training set:
Minimum  =  4 x 10-7 mg/L (octachlorodibenzo-p-dioxin)
Maximum =  completely soluble (various)

Range of Molecular Weights in the Training set:
Minimum  =  27.03 (hydrocyanic acid)
Maximum =  627.62 (hexabromobiphenyl)

Range of Log Kow values in the Training set:
Minimum  =  -3.89 (aspartic acid)
Maximum =  8.27 (decachlorobiphenyl)

Currently there is no universally accepted definition of model domain.  However, users may wish to consider the possibility that water solubility estimates are less accurate for compounds outside the MW range, water solubility range and log Kow range of the training set compounds.  It is also possible that a compound may have a functional group(s) or other structural features not represented in the training set, and for which no correction factor was developed.  These points should be taken into consideration when interpreting model results.

WATERNT
Appendix D lists (for each fragment and correction factor) the maximum number of instances of that fragment in any of the 1128 training set compounds (the minimum number of instances is of course zero, since not all compounds had every fragment).  The minimum and maximum values for molecular weight are the following:

Training Set Molecular Weights:
Minimum MW:   30.30 (formaldehyde)
Maximum MW:  627.62 (hexabromobiphenyl)
Average MW:     187.73


Training Set Water Solubility Ranges:
Minimum Solubility (mg/L):   0.0000004  (octachlorodibenzo-p-dioxin)
Minimum Solubility (log moles/L):  -12.0605  (octachlorodibenzo-p-dioxin)
Maximum Solubility (mg/L):  miscible  (various)
Maximum Solubility (log moles/L):  1.3561  (acetaldehyde)
 
Currently there is no universally accepted definition of model domain.  However, users may wish to consider the possibility that water solubility estimates are less accurate for compounds outside the MW range of the training set compounds, and/or that have more instances of a given fragment than the maximum for all training set compounds.  It is also possible that a compound may have a functional group(s) or other structural features not represented in the training set, and for which no fragment coefficient was developed.  These points should be taken into consideration when interpreting model results.
 
6. ADEQUACY OF THE RESULT
The molecular weight of the substance is above the range of molecular weight in the training sets of both models. However, the calcualted values agree with the expectation of very low solubility, as observed experimentally and are cosidered as valid for supporting the limit solubility value obtained at the laboratory.
Reason / purpose for cross-reference:
other: QPRF constituent #1
Guideline:
other:
Version / remarks:
REACH Guidance on QSARs R.6
Principles of method if other than guideline:
WSKowWIN:
Meylan, W.M. and P.H. Howard.    1994a.    Upgrade of PCGEMS Water Solubility Estimation Method (May 1994 Draft).  prepared for Robert S. Boethling, U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics, Washington, DC;  prepared by Syracuse Research Corporation, Environmental Science Center, Syracuse, NY 13210.

WATERNT:
Meylan, W.M. and P.H. Howard, 1995 Atom/fragment contribution method for estimating octanol-water partition coefficients. J. Pharm. Sci. 84: 83-92
GLP compliance:
no
Specific details on test material used for the study:
SMILES : O=C(CCCCCC)OCC(COC(=O)CCCCCC)(COC(=O)CCCCCC)COCC(COC(=O)CCCCCC)(COC(=O)CCCCCC)COC(=O)CCCCCC
Water solubility:
< 0 mg/L
Conc. based on:
test mat.
Remarks on result:
other: QSAR predicted value

WSKOWIN predicted that the constituent DPE777777 has a water solubility of 4.25E-012 mg/L

WATERNT predicted that the constituent DPE777777 has a water solubility of 9.27E-007 mg/L

Conclusions:
Water solubility of DPE777777 is predicted to be <0.00001 mg/L
Endpoint:
water solubility
Type of information:
(Q)SAR
Adequacy of study:
weight of evidence
Study period:
December 2018
Reliability:
2 (reliable with restrictions)
Justification for type of information:
1. SOFTWARE:
programs WSKOWIN and WATERNT included in EPISUITE (Estimation Programs Interface Suite™ for Microsoft® Windows, v 4.11)


2. MODEL (incl. version number)

WSKowWin v1.42 estimates the water solubility (WSol) of an organic compound using the compound’s log octanol-water partition coefficient (Kow). WSKOWWIN requires only a chemical structure to estimate Wsol.
The estimation methodology used by WSKOWWIN (Meylan and Howard, 1994a,b) is described in the following document prepared for the U.S. Environmental Protection Agency (OPPT): Upgrade of PCGEMS Water Solubility Estimation Method (May 1994). A companion document (Validation of Water Solubility Estimation Methods Using Log Kow for Application in PCGEMS & EPI) also discusses the methodology. A journal article that describes the methodology is also available (Meylan et al., 1996).

The Water Solubility Program (WATERNT v1.01) estimates the water solubility of organic compounds at 25ºC. WATERNT requires only a chemical structure to estimate a solubility. Structures are entered into WATERNT by SMILES (Simplified Molecular Input Line Entry System) notations.


3. SMILES OR OTHER IDENTIFIERS USED AS INPUT FOR THE MODEL

SMILES : O=C(CCCCCC)OCC(COC(=O)CCCCCC)(COC(=O)CC(C)CC(C)(C)C)COCC(COC(=O)CCCCCC)(COC(=O)CCCCCC)COC(=O)CCCCCC

4. SCIENTIFIC VALIDITY OF THE (Q)SAR MODEL

- Defined endpoint:
Water solubility of an organic compound, maximum amount of substance that can be dissolved in water (g/L)

- Unambiguous algorithm:
WSKOWWIN estimates the water solubility (WSol) of an organic compound using the compound’s log octanol-water partition coefficient (Kow). WSKOWWIN requires only a chemical structure to estimate Wsol.
The estimation methodology used by WSKOWWIN (Meylan and Howard, 1994a,b) is described in the following document prepared for the U.S. Environmental Protection Agency (OPPT): Upgrade of PCGEMS Water Solubility Estimation Method (May 1994). A companion document (Validation of Water Solubility Estimation Methods Using Log Kow for Application in PCGEMS & EPI) also discusses the methodology. A journal article that describes the methodology is also available (Meylan et al., 1996).

WSKOWWIN uses equations 19 and 20 from these documents because they are the best available equations for estimating Wsol:
Equation 19 is: log S (mol/L) = 0.796 - 0.854 log Kow - 0.00728 MW + Corrections
Equation 20 is: log S (mol/L) = 0.693 - 0.96 log Kow - 0.0092(Tm-25) - 0.00314 MW + Corrections
where MW is molecular weight, Tm is melting point (MP) in deg C (used only for solids). Corrections are applied to 15 structure types (eg. alcohols, acids, selected phenols, nitros, amines, alkyl pyridines, amino acids, PAHS, multi-nitrogen types, etc).

Equation 20 is used when a measured MP is available (like in the case of study, EC 945-883-1); otherwise, equation 19 is used. These equations were derived from a dataset consisting of 1450 compounds with measured log Kow, water sol, and MP. Eq 20 has the following statistical accuracy: correlation coefficient (r2) = 0.97, standard deviation = 0.409 log units, and absolute mean error = 0.313 log units. Application to a validation dataset of 817 compounds gave the following statistical accuracy: correlation coefficient (r2) = 0.902, standard deviation = 0.615 log units, and absolute mean error = 0.480 log units.

WSKOWWIN estimates a log Kow for every SMILES notation by using the estimation engine from the KOWWIN Program (SRC, 2000). WSKOWWIN also automatically retrieves experimental log Kow values from a database containing more than 13200 organic compounds with reliably measured values. When a SMILES structure matches a database structure (via an exact atom-to-atom connection match), the experimental log Kow value is retrieved and used to predict Wsol rather than the estimated value.

WSKOWWIN v1.4 includes an experimental water solubility database of 6230 compounds. When experimental data are available for the SMILES being estimated, the data are retrieved and shown in the Results Window.
A complete description of the estimation methodology used by WSKOWWIN is available in two documents prepared for the U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics (Meylan and Howard, 1994a,b). A journal article that describes the methodology is also available (Meylan et al., 1996).  The WSKOWWIN program estimates the water solubility of an organic compound using the compounds log octanol-water partition coefficient (log Kow). A brief description is given below.

Data Collection
A database of more than 8400 compounds with reliably measured log Kow values had already been compiled from available sources.  Most experimental values were taken from a "star-list" compilation of Hansch and Leo (1985) that had already been critically evaluated (see also Hansch et al, 1995) or an extensive compilation by Sangster (1993) that includes many "recommended" values based upon critical evaluation.  Other log Kow values were taken from sources located through the Environmental Fate Data Base (EFDB) system (Howard et al, 1982, 1986).  A few values were taken from Section 4a, 8d, and 8e submissions the to U.S. EPA under the Toxic Substances Control Act (see http://www.syrres.com/esc/tscats_info.htm).

Water solubilities were collected from the AQUASOL dATAbASETM of the University of Arizona (Yalkowsky and Dannenfelser, 1990), Syracuse Research Corporation's PHYSPROP© Database (SRC,1994), and sources located through the Environmental Fate Data Base (EFDB) system (Howard et al, 1982, 1986).  Water solubilities were primarily constrained to the 20-25oC temperature range with 25oC being preferred.

Melting points were collected from sources such as AQUASOL dATAbASETM,  PHYSPROP©, and EDFB as well as the Handbook of Chemistry and Physics (Lide, 1990) and the Aldrich Catalog (Aldrich, 1992).
Regression & Results

A dataset of 1450 compounds (941 solids, 509 liquids) having reliably measured water solubility, log Kow and melting point was used as the training set for developing the new estimation algorithms for water solubility.  Standard linear regressions were used to fit  water solubility (as log S) with log Kow, melting point and molecular weight.

Residual errors from the initial regression fit were examined for compounds sharing common structural features with relatively consistent errors.  On that basis, 12 compound classes were initially identified and added to the regression to comprise a multi-linear regression including log Kow, melting point and/or molecular weight plus 12 correction factors.  Each correction factor is counted a maximum of once per structure [if applicable], no matter how many times the applicable fragment occurs.  For example, the nitro factor in 1,4-dinitrobenzene is counted just once.  A compound either contains a correction factor or it doesn't; therefore, the matrix for the multi-linear regression contained either a 0 or 1 for each correction factor. Appendix E describes the correction factors and coefficients used by WSKOWWIN.

WSKOWWIN estimates water solubility for any compound with one of two possible equations.  The equations are equations 19 and 20 from Meylan and Howard (1994a) or equations 11 and 12 from the journal article (Meylan et al., 1996).  The equations are:

     log S (mol/L)  =  0.796 - 0.854 log Kow - 0.00728 MW + ΣCorrections

    log S (mol/L)  =  0.693 - 0.96 log Kow - 0.0092(Tm-25) - 0.00314 MW + ΣCorrections

(where MW is molecular weight, Tm is melting point (MP) in deg C [used only for solids]) ... Summation of Corrections (ΣCorrections) are applied as described in Appendix E.   When a measured MP is available, that equation is used; otherwise, the equation with just MW is used.


The WATERNT program and estimation methodology were developed at Syracuse Research Corporation for the US Environmental Protection Agency (described inthe document Preliminary Report: Water Solubility Estimation by Base Compound Modification (Sept 1995). The estimation methodology is based upon a "fragment constant" method very similar to the method of the KOWWIN Program which estimates octanol-water partition coefficients. A journal article by Meylan and Howard (1995) describes the KOWWIN program methodology.

In a "fragment constant" method, a structure is divided into fragments (atom or larger functional groups) and coefficient values of each fragment or group are summed together to yield the solubility estimate. We call WATERNT’s methodology the Atom/Fragment Contribution (AFC) method. Coefficients for individual fragments and groups in WATERNT were derived by multiple regression of 1000 reliably measured water solubility values.

To estimate water solubility, WATERNT initially separates a molecule into distinct atom/fragments. In general, each non-hydrogen atom (e.g. carbon, nitrogen, oxygen, sulfur, etc.) in a structure is a "core" for a fragment; the exact fragment is determined by what is connected to the atom. Several functional groups are treated as core "atoms"; these include carbonyl (C=O), thiocarbonyl (C=S), nitro (-NO2), nitrate (ONO2), cyano (-C/N), and isothiocyanate (-N=C=S). Connections to each core "atom" are either general or specific; specific connections take precedence over general connections. For example, aromatic carbon, aromatic oxygen and aromatic sulfur atoms have nothing but general connections; i.e., the fragment is the same no matter what is connected to the atom. In contrast, there are 5 aromatic nitrogen fragments: (a) in a five-member ring, (b) in a six-member ring, (c) if the nitrogen is an oxide-type {i.e. pyridine oxide}, (d) if the nitrogen has a fused ring location {i.e. indolizine}, and (e) if the nitrogen has a +5 valence {i.e. N-methyl pyridinium iodide}; since the oxide-type is most specific, it takes precedence over the other four. The aliphatic carbon atom is another example; it does not matter what is connected to -CH3, -CH2-, or -CH< , the fragment is the same; however, an aliphatic carbon with no hydrogens has two possible fragments: (a) if there are four single bonds with 3 or more carbon connections and (b) any other not meeting the first criteria.

It became apparent, for various types of structures, that water solubility estimates made from atom/fragment values alone could or needed to be improved by inclusion of substructures larger or more complex than "atoms"; hence, correction factors were added to the AFC method. The term "correction factor" is appropriate because their values are derived from the differences between the water solubility estimates from atoms alone and the measured water solubility values. The correction factors have two main groupings: first, factors involving aromatic ring substituent positions and second, miscellaneous factors. In general, the correction factors are values for various steric interactions, hydrogen-bondings, and effects from polar functional substructures. Individual correction factors were selected through a tedious process of correlating the differences (between solubility estimates from atom/fragments alone and measured solubility values) with common substructures.

Results of two successive multiple regressions (first for atom/fragments and second for correction factors) yield the following general equation for estimating water solubility of any organic compound:

log WatSol (moles/L) = Σ(fi * ni) + Σ(cj * nj) + 0.24922
(n = 1128, correlation coef (r2) = 0.940, standard deviation = 0.537, avg deviation = 0.355)
where Σ(fi * ni) is the summation of fi (the coefficient for each atom/fragment) times ni (the number of times the atom/fragment occurs in the structure) and Σ(cj * nj) is the summation of cj (the coefficient for each correction factor) times nj (the number of times the correction factor is applied in the molecule).

Additionally, WATERNT automatically retrieves experimental water solubility values from a database containing more than 6200 organic compounds with measured values. When a SMILES structure matches a database structure (via an exact atom-to-atom connection match), the experimental water solubility value is retrieved and shown in the Results Window.


 - Defined domain of applicability:
(see point 5 below)

- Appropriate measures of goodness-of-fit and robustness and predictivity:
WSKOWWIN
The regression equations used by the WSKOWWIN program were trained with a dataset of 1450 compounds. WSKOWWIN estimates water solubility with one of two possible equations.  When an experimental melting point is available, WSKOWWIN applies the equation containing both a melting point and the molecular weight (MW) parameters.  In the absence of a melting point, the equation containing just the molecular weight is used to make the estimate.  All compounds in the 1450 compound training set have known melting points or are known to be liquids at 25oC.  The accuracy statistics for the two equations are as follows:

Melt Pt + MW MW only
r2 0.970 0.934
std deviation 0.409 0.585
avg deviation 0.313 0.442

Validation

The WSKOWWIN estimation equations were initially validated on two datasets of compounds that were not included in the model training.  A relatively small dataset was tested that consisted of 85 compounds having experimental log Kow values, but no available melting points.  Many compounds in the 85 compound test set decompose before melting and would theoretically have very high melting points (e.g. amino acids and compounds having multiple nitrogens).  The accuracy statistics for the equation used by WSKOWWIN are:

number 85
r2 0.865
std deviation 0.961
avg deviation 0.714
 
A much larger dataset of 817 compounds was also tested.  All 817 compounds had experimental melting points, but none of the 817 compounds had a reliable experimental log Kow.  The log Kow values used for the validation-testing were estimated (primarily using the KOWWIN program available at that time); therefore, the water solubility estimates are based on estimates for log Kow.  Typically, estimates based on estimates reduce estimation accuracy, but this type of validation can provide insight into the ability of the method.  The accuracy statistics for this dataset are:

number 817
r2 0.902
std deviation 0.615
avg deviation 0.480
  
Availability of Training & Validation Datasets

The complete datasets used to train and validate the SAR equations used by the WSKOWWIN program are available in two documents prepared for the U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics (Meylan and Howard, 1994a,b).  These documents, which also detail the estimation methodology, can be downloaded from the Internet at:

http://esc.syrres.com/interkow/EpiSuiteData.htm

Substructure searchable formats of the data can be download at:
http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm


WATERNT
Training Set Data
The following figure illustrates the estimation accuracy of the current WATERNT Program training set:
n= 1128
r2= 0.940
std dev= 0.537
avg dev= 0.355

Validation Data Sets
Currently, WATERNT has been tested on a validation dataset of 4,636 compounds not included in the training set.  These 4636 compounds were collected from the PHYSPROP Database.  Various compounds having experimental water solubility values in PHYSPROP were excluded from the validation set (these included the majority of compounds that were inorganics, had measurements outside a temperature range of 10 to 40 degrees C, or were measured at specific pH values that might skew an estimation comparison).  The complete training and validation data sets are available as noted below.
 
The validation set includes a diverse selection of chemical structures that rigorously test the predictive accuracy of any model.  It contains many chemicals that are similar in structure to chemicals in the training set, but also many chemicals that are different from and structurally more complex than chemicals in the training set.  Statistical performance for estimated vs experimental log WatSol (moles/L) are as follows:
n= 4636
r2= 0.815
std dev= 1.045
avg dev= 0.796

Availability of Training and Validation Datasets

The complete training and validation data sets can be downloaded from the Internet at:
http://esc.syrres.com/interkow/EpiSuiteData.htm

Substructure searchable formats of the data can be downloaded at:
http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm


- Mechanistic interpretation:


5. APPLICABILITY DOMAIN
WSKOWWIN
Appendix E gives the number compounds in the 1450 compound training set containing each of the correction factors.  The WSKOWWIN program applies an individual correction factor only once per structure [if at all] regardless of how many instances of the applicable structural feature occur in the structure.  The minimum number of instances is zero and the maximum is one.

 

Range of water solubilities in the Training set:
Minimum  =  4 x 10-7 mg/L (octachlorodibenzo-p-dioxin)
Maximum =  completely soluble (various)

Range of Molecular Weights in the Training set:
Minimum  =  27.03 (hydrocyanic acid)
Maximum =  627.62 (hexabromobiphenyl)

Range of Log Kow values in the Training set:
Minimum  =  -3.89 (aspartic acid)
Maximum =  8.27 (decachlorobiphenyl)

Currently there is no universally accepted definition of model domain.  However, users may wish to consider the possibility that water solubility estimates are less accurate for compounds outside the MW range, water solubility range and log Kow range of the training set compounds.  It is also possible that a compound may have a functional group(s) or other structural features not represented in the training set, and for which no correction factor was developed.  These points should be taken into consideration when interpreting model results.

WATERNT
Appendix D lists (for each fragment and correction factor) the maximum number of instances of that fragment in any of the 1128 training set compounds (the minimum number of instances is of course zero, since not all compounds had every fragment).  The minimum and maximum values for molecular weight are the following:

Training Set Molecular Weights:
Minimum MW:   30.30 (formaldehyde)
Maximum MW:  627.62 (hexabromobiphenyl)
Average MW:     187.73


Training Set Water Solubility Ranges:
Minimum Solubility (mg/L):   0.0000004  (octachlorodibenzo-p-dioxin)
Minimum Solubility (log moles/L):  -12.0605  (octachlorodibenzo-p-dioxin)
Maximum Solubility (mg/L):  miscible  (various)
Maximum Solubility (log moles/L):  1.3561  (acetaldehyde)
 
Currently there is no universally accepted definition of model domain.  However, users may wish to consider the possibility that water solubility estimates are less accurate for compounds outside the MW range of the training set compounds, and/or that have more instances of a given fragment than the maximum for all training set compounds.  It is also possible that a compound may have a functional group(s) or other structural features not represented in the training set, and for which no fragment coefficient was developed.  These points should be taken into consideration when interpreting model results.
 
6. ADEQUACY OF THE RESULT
The molecular weight of the substance is above the range of molecular weight in the training sets of both models. However, the calcualted values agree with the expectation of very low solubility, as observed experimentally and are cosidered as valid for supporting the limit solubility value obtained at the laboratory.
Reason / purpose for cross-reference:
(Q)SAR model reporting (QMRF)
Reason / purpose for cross-reference:
other: QPRF constituent #2
Guideline:
other:
Version / remarks:
REACH Guidance on QSARs R.6
Principles of method if other than guideline:
WSKowWIN:
Meylan, W.M. and P.H. Howard.    1994a.    Upgrade of PCGEMS Water Solubility Estimation Method (May 1994 Draft).  prepared for Robert S. Boethling, U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics, Washington, DC;  prepared by Syracuse Research Corporation, Environmental Science Center, Syracuse, NY 13210.

WATERNT:
Meylan, W.M. and P.H. Howard, 1995 Atom/fragment contribution method for estimating octanol-water partition coefficients. J. Pharm. Sci. 84: 83-92
GLP compliance:
no
Specific details on test material used for the study:
SMILES : O=C(CCCCCC)OCC(COC(=O)CCCCCC)(COC(=O)CC(C)CC(C)(C)C)COCC(COC(=O)CCCCCC)(COC(=O)CCCCCC)COC(=O)CCCCCC
Water solubility:
< 0 mg/L
Conc. based on:
test mat.
Remarks on result:
other: QSAR predicted value

WSKOWIN predicted that the constituent DPE777779 has a water solubility of 6.13E-013 mg/L

WATERNT predicted that the constituent DPE777779 has a water solubility of 9.55E-007 mg/L

Conclusions:
Water solubility of DPE777779 is predicted to be <0.00001 mg/L
Endpoint:
water solubility
Type of information:
(Q)SAR
Adequacy of study:
weight of evidence
Study period:
December 2018
Reliability:
2 (reliable with restrictions)
Justification for type of information:
1. SOFTWARE:
programs WSKOWIN and WATERNT included in EPISUITE (Estimation Programs Interface Suite™ for Microsoft® Windows, v 4.11)


2. MODEL (incl. version number)

WSKowWin v1.42 estimates the water solubility (WSol) of an organic compound using the compound’s log octanol-water partition coefficient (Kow). WSKOWWIN requires only a chemical structure to estimate Wsol.
The estimation methodology used by WSKOWWIN (Meylan and Howard, 1994a,b) is described in the following document prepared for the U.S. Environmental Protection Agency (OPPT): Upgrade of PCGEMS Water Solubility Estimation Method (May 1994). A companion document (Validation of Water Solubility Estimation Methods Using Log Kow for Application in PCGEMS & EPI) also discusses the methodology. A journal article that describes the methodology is also available (Meylan et al., 1996).

The Water Solubility Program (WATERNT v1.01) estimates the water solubility of organic compounds at 25ºC. WATERNT requires only a chemical structure to estimate a solubility. Structures are entered into WATERNT by SMILES (Simplified Molecular Input Line Entry System) notations.


3. SMILES OR OTHER IDENTIFIERS USED AS INPUT FOR THE MODEL

SMILES : O=C(CCCCCC)OCC(COC(=O)CCCCCC)(COC(=O)CC(C)CC(C)(C)C)COCC(COC(=O)CCCCCC)(COC(=O)CC(C)CC(C)(C)C)COC(=O)CCCCCC

4. SCIENTIFIC VALIDITY OF THE (Q)SAR MODEL

- Defined endpoint:
Water solubility of an organic compound, maximum amount of substance that can be dissolved in water (g/L)

- Unambiguous algorithm:
WSKOWWIN estimates the water solubility (WSol) of an organic compound using the compound’s log octanol-water partition coefficient (Kow). WSKOWWIN requires only a chemical structure to estimate Wsol.
The estimation methodology used by WSKOWWIN (Meylan and Howard, 1994a,b) is described in the following document prepared for the U.S. Environmental Protection Agency (OPPT): Upgrade of PCGEMS Water Solubility Estimation Method (May 1994). A companion document (Validation of Water Solubility Estimation Methods Using Log Kow for Application in PCGEMS & EPI) also discusses the methodology. A journal article that describes the methodology is also available (Meylan et al., 1996).

WSKOWWIN uses equations 19 and 20 from these documents because they are the best available equations for estimating Wsol:
Equation 19 is: log S (mol/L) = 0.796 - 0.854 log Kow - 0.00728 MW + Corrections
Equation 20 is: log S (mol/L) = 0.693 - 0.96 log Kow - 0.0092(Tm-25) - 0.00314 MW + Corrections
where MW is molecular weight, Tm is melting point (MP) in deg C (used only for solids). Corrections are applied to 15 structure types (eg. alcohols, acids, selected phenols, nitros, amines, alkyl pyridines, amino acids, PAHS, multi-nitrogen types, etc).

Equation 20 is used when a measured MP is available (like in the case of study, EC 945-883-1); otherwise, equation 19 is used. These equations were derived from a dataset consisting of 1450 compounds with measured log Kow, water sol, and MP. Eq 20 has the following statistical accuracy: correlation coefficient (r2) = 0.97, standard deviation = 0.409 log units, and absolute mean error = 0.313 log units. Application to a validation dataset of 817 compounds gave the following statistical accuracy: correlation coefficient (r2) = 0.902, standard deviation = 0.615 log units, and absolute mean error = 0.480 log units.

WSKOWWIN estimates a log Kow for every SMILES notation by using the estimation engine from the KOWWIN Program (SRC, 2000). WSKOWWIN also automatically retrieves experimental log Kow values from a database containing more than 13200 organic compounds with reliably measured values. When a SMILES structure matches a database structure (via an exact atom-to-atom connection match), the experimental log Kow value is retrieved and used to predict Wsol rather than the estimated value.

WSKOWWIN v1.4 includes an experimental water solubility database of 6230 compounds. When experimental data are available for the SMILES being estimated, the data are retrieved and shown in the Results Window.
A complete description of the estimation methodology used by WSKOWWIN is available in two documents prepared for the U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics (Meylan and Howard, 1994a,b). A journal article that describes the methodology is also available (Meylan et al., 1996).  The WSKOWWIN program estimates the water solubility of an organic compound using the compounds log octanol-water partition coefficient (log Kow). A brief description is given below.

Data Collection
A database of more than 8400 compounds with reliably measured log Kow values had already been compiled from available sources.  Most experimental values were taken from a "star-list" compilation of Hansch and Leo (1985) that had already been critically evaluated (see also Hansch et al, 1995) or an extensive compilation by Sangster (1993) that includes many "recommended" values based upon critical evaluation.  Other log Kow values were taken from sources located through the Environmental Fate Data Base (EFDB) system (Howard et al, 1982, 1986).  A few values were taken from Section 4a, 8d, and 8e submissions the to U.S. EPA under the Toxic Substances Control Act (see http://www.syrres.com/esc/tscats_info.htm).

Water solubilities were collected from the AQUASOL dATAbASETM of the University of Arizona (Yalkowsky and Dannenfelser, 1990), Syracuse Research Corporation's PHYSPROP© Database (SRC,1994), and sources located through the Environmental Fate Data Base (EFDB) system (Howard et al, 1982, 1986).  Water solubilities were primarily constrained to the 20-25oC temperature range with 25oC being preferred.

Melting points were collected from sources such as AQUASOL dATAbASETM,  PHYSPROP©, and EDFB as well as the Handbook of Chemistry and Physics (Lide, 1990) and the Aldrich Catalog (Aldrich, 1992).
Regression & Results

A dataset of 1450 compounds (941 solids, 509 liquids) having reliably measured water solubility, log Kow and melting point was used as the training set for developing the new estimation algorithms for water solubility.  Standard linear regressions were used to fit  water solubility (as log S) with log Kow, melting point and molecular weight.

Residual errors from the initial regression fit were examined for compounds sharing common structural features with relatively consistent errors.  On that basis, 12 compound classes were initially identified and added to the regression to comprise a multi-linear regression including log Kow, melting point and/or molecular weight plus 12 correction factors.  Each correction factor is counted a maximum of once per structure [if applicable], no matter how many times the applicable fragment occurs.  For example, the nitro factor in 1,4-dinitrobenzene is counted just once.  A compound either contains a correction factor or it doesn't; therefore, the matrix for the multi-linear regression contained either a 0 or 1 for each correction factor. Appendix E describes the correction factors and coefficients used by WSKOWWIN.

WSKOWWIN estimates water solubility for any compound with one of two possible equations.  The equations are equations 19 and 20 from Meylan and Howard (1994a) or equations 11 and 12 from the journal article (Meylan et al., 1996).  The equations are:

     log S (mol/L)  =  0.796 - 0.854 log Kow - 0.00728 MW + ΣCorrections

    log S (mol/L)  =  0.693 - 0.96 log Kow - 0.0092(Tm-25) - 0.00314 MW + ΣCorrections

(where MW is molecular weight, Tm is melting point (MP) in deg C [used only for solids]) ... Summation of Corrections (ΣCorrections) are applied as described in Appendix E.   When a measured MP is available, that equation is used; otherwise, the equation with just MW is used.


The WATERNT program and estimation methodology were developed at Syracuse Research Corporation for the US Environmental Protection Agency (described inthe document Preliminary Report: Water Solubility Estimation by Base Compound Modification (Sept 1995). The estimation methodology is based upon a "fragment constant" method very similar to the method of the KOWWIN Program which estimates octanol-water partition coefficients. A journal article by Meylan and Howard (1995) describes the KOWWIN program methodology.

In a "fragment constant" method, a structure is divided into fragments (atom or larger functional groups) and coefficient values of each fragment or group are summed together to yield the solubility estimate. We call WATERNT’s methodology the Atom/Fragment Contribution (AFC) method. Coefficients for individual fragments and groups in WATERNT were derived by multiple regression of 1000 reliably measured water solubility values.

To estimate water solubility, WATERNT initially separates a molecule into distinct atom/fragments. In general, each non-hydrogen atom (e.g. carbon, nitrogen, oxygen, sulfur, etc.) in a structure is a "core" for a fragment; the exact fragment is determined by what is connected to the atom. Several functional groups are treated as core "atoms"; these include carbonyl (C=O), thiocarbonyl (C=S), nitro (-NO2), nitrate (ONO2), cyano (-C/N), and isothiocyanate (-N=C=S). Connections to each core "atom" are either general or specific; specific connections take precedence over general connections. For example, aromatic carbon, aromatic oxygen and aromatic sulfur atoms have nothing but general connections; i.e., the fragment is the same no matter what is connected to the atom. In contrast, there are 5 aromatic nitrogen fragments: (a) in a five-member ring, (b) in a six-member ring, (c) if the nitrogen is an oxide-type {i.e. pyridine oxide}, (d) if the nitrogen has a fused ring location {i.e. indolizine}, and (e) if the nitrogen has a +5 valence {i.e. N-methyl pyridinium iodide}; since the oxide-type is most specific, it takes precedence over the other four. The aliphatic carbon atom is another example; it does not matter what is connected to -CH3, -CH2-, or -CH< , the fragment is the same; however, an aliphatic carbon with no hydrogens has two possible fragments: (a) if there are four single bonds with 3 or more carbon connections and (b) any other not meeting the first criteria.

It became apparent, for various types of structures, that water solubility estimates made from atom/fragment values alone could or needed to be improved by inclusion of substructures larger or more complex than "atoms"; hence, correction factors were added to the AFC method. The term "correction factor" is appropriate because their values are derived from the differences between the water solubility estimates from atoms alone and the measured water solubility values. The correction factors have two main groupings: first, factors involving aromatic ring substituent positions and second, miscellaneous factors. In general, the correction factors are values for various steric interactions, hydrogen-bondings, and effects from polar functional substructures. Individual correction factors were selected through a tedious process of correlating the differences (between solubility estimates from atom/fragments alone and measured solubility values) with common substructures.

Results of two successive multiple regressions (first for atom/fragments and second for correction factors) yield the following general equation for estimating water solubility of any organic compound:

log WatSol (moles/L) = Σ(fi * ni) + Σ(cj * nj) + 0.24922
(n = 1128, correlation coef (r2) = 0.940, standard deviation = 0.537, avg deviation = 0.355)
where Σ(fi * ni) is the summation of fi (the coefficient for each atom/fragment) times ni (the number of times the atom/fragment occurs in the structure) and Σ(cj * nj) is the summation of cj (the coefficient for each correction factor) times nj (the number of times the correction factor is applied in the molecule).

Additionally, WATERNT automatically retrieves experimental water solubility values from a database containing more than 6200 organic compounds with measured values. When a SMILES structure matches a database structure (via an exact atom-to-atom connection match), the experimental water solubility value is retrieved and shown in the Results Window.


 - Defined domain of applicability:
(see point 5 below)

- Appropriate measures of goodness-of-fit and robustness and predictivity:
WSKOWWIN
The regression equations used by the WSKOWWIN program were trained with a dataset of 1450 compounds. WSKOWWIN estimates water solubility with one of two possible equations.  When an experimental melting point is available, WSKOWWIN applies the equation containing both a melting point and the molecular weight (MW) parameters.  In the absence of a melting point, the equation containing just the molecular weight is used to make the estimate.  All compounds in the 1450 compound training set have known melting points or are known to be liquids at 25oC.  The accuracy statistics for the two equations are as follows:

Melt Pt + MW MW only
r2 0.970 0.934
std deviation 0.409 0.585
avg deviation 0.313 0.442

Validation

The WSKOWWIN estimation equations were initially validated on two datasets of compounds that were not included in the model training.  A relatively small dataset was tested that consisted of 85 compounds having experimental log Kow values, but no available melting points.  Many compounds in the 85 compound test set decompose before melting and would theoretically have very high melting points (e.g. amino acids and compounds having multiple nitrogens).  The accuracy statistics for the equation used by WSKOWWIN are:

number 85
r2 0.865
std deviation 0.961
avg deviation 0.714
 
A much larger dataset of 817 compounds was also tested.  All 817 compounds had experimental melting points, but none of the 817 compounds had a reliable experimental log Kow.  The log Kow values used for the validation-testing were estimated (primarily using the KOWWIN program available at that time); therefore, the water solubility estimates are based on estimates for log Kow.  Typically, estimates based on estimates reduce estimation accuracy, but this type of validation can provide insight into the ability of the method.  The accuracy statistics for this dataset are:

number 817
r2 0.902
std deviation 0.615
avg deviation 0.480
  
Availability of Training & Validation Datasets

The complete datasets used to train and validate the SAR equations used by the WSKOWWIN program are available in two documents prepared for the U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics (Meylan and Howard, 1994a,b).  These documents, which also detail the estimation methodology, can be downloaded from the Internet at:

http://esc.syrres.com/interkow/EpiSuiteData.htm

Substructure searchable formats of the data can be download at:
http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm


WATERNT
Training Set Data
The following figure illustrates the estimation accuracy of the current WATERNT Program training set:
n= 1128
r2= 0.940
std dev= 0.537
avg dev= 0.355

Validation Data Sets
Currently, WATERNT has been tested on a validation dataset of 4,636 compounds not included in the training set.  These 4636 compounds were collected from the PHYSPROP Database.  Various compounds having experimental water solubility values in PHYSPROP were excluded from the validation set (these included the majority of compounds that were inorganics, had measurements outside a temperature range of 10 to 40 degrees C, or were measured at specific pH values that might skew an estimation comparison).  The complete training and validation data sets are available as noted below.
 
The validation set includes a diverse selection of chemical structures that rigorously test the predictive accuracy of any model.  It contains many chemicals that are similar in structure to chemicals in the training set, but also many chemicals that are different from and structurally more complex than chemicals in the training set.  Statistical performance for estimated vs experimental log WatSol (moles/L) are as follows:
n= 4636
r2= 0.815
std dev= 1.045
avg dev= 0.796

Availability of Training and Validation Datasets

The complete training and validation data sets can be downloaded from the Internet at:
http://esc.syrres.com/interkow/EpiSuiteData.htm

Substructure searchable formats of the data can be downloaded at:
http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm


- Mechanistic interpretation:


5. APPLICABILITY DOMAIN
WSKOWWIN
Appendix E gives the number compounds in the 1450 compound training set containing each of the correction factors.  The WSKOWWIN program applies an individual correction factor only once per structure [if at all] regardless of how many instances of the applicable structural feature occur in the structure.  The minimum number of instances is zero and the maximum is one.

 

Range of water solubilities in the Training set:
Minimum  =  4 x 10-7 mg/L (octachlorodibenzo-p-dioxin)
Maximum =  completely soluble (various)

Range of Molecular Weights in the Training set:
Minimum  =  27.03 (hydrocyanic acid)
Maximum =  627.62 (hexabromobiphenyl)

Range of Log Kow values in the Training set:
Minimum  =  -3.89 (aspartic acid)
Maximum =  8.27 (decachlorobiphenyl)

Currently there is no universally accepted definition of model domain.  However, users may wish to consider the possibility that water solubility estimates are less accurate for compounds outside the MW range, water solubility range and log Kow range of the training set compounds.  It is also possible that a compound may have a functional group(s) or other structural features not represented in the training set, and for which no correction factor was developed.  These points should be taken into consideration when interpreting model results.

WATERNT
Appendix D lists (for each fragment and correction factor) the maximum number of instances of that fragment in any of the 1128 training set compounds (the minimum number of instances is of course zero, since not all compounds had every fragment).  The minimum and maximum values for molecular weight are the following:

Training Set Molecular Weights:
Minimum MW:   30.30 (formaldehyde)
Maximum MW:  627.62 (hexabromobiphenyl)
Average MW:     187.73


Training Set Water Solubility Ranges:
Minimum Solubility (mg/L):   0.0000004  (octachlorodibenzo-p-dioxin)
Minimum Solubility (log moles/L):  -12.0605  (octachlorodibenzo-p-dioxin)
Maximum Solubility (mg/L):  miscible  (various)
Maximum Solubility (log moles/L):  1.3561  (acetaldehyde)
 
Currently there is no universally accepted definition of model domain.  However, users may wish to consider the possibility that water solubility estimates are less accurate for compounds outside the MW range of the training set compounds, and/or that have more instances of a given fragment than the maximum for all training set compounds.  It is also possible that a compound may have a functional group(s) or other structural features not represented in the training set, and for which no fragment coefficient was developed.  These points should be taken into consideration when interpreting model results.
 
6. ADEQUACY OF THE RESULT
The molecular weight of the substance is above the range of molecular weight in the training sets of both models. However, the calcualted values agree with the expectation of very low solubility, as observed experimentally and are cosidered as valid for supporting the limit solubility value obtained at the laboratory.
Reason / purpose for cross-reference:
(Q)SAR model reporting (QMRF)
Reason / purpose for cross-reference:
other: QPRF constituent #3
Guideline:
other:
Version / remarks:
REACH Guidance on QSARs R.6
Principles of method if other than guideline:
WSKowWIN:
Meylan, W.M. and P.H. Howard.    1994a.    Upgrade of PCGEMS Water Solubility Estimation Method (May 1994 Draft).  prepared for Robert S. Boethling, U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics, Washington, DC;  prepared by Syracuse Research Corporation, Environmental Science Center, Syracuse, NY 13210.

WATERNT:
Meylan, W.M. and P.H. Howard, 1995 Atom/fragment contribution method for estimating octanol-water partition coefficients. J. Pharm. Sci. 84: 83-92
GLP compliance:
no
Specific details on test material used for the study:
SMILES : O=C(CCCCCC)OCC(COC(=O)CCCCCC)(COC(=O)CC(C)CC(C)(C)C)COCC(COC(=O)CCCCCC)(COC(=O)CC(C)CC(C)(C)C)COC(=O)CCCCCC
Water solubility:
< 0 mg/L
Conc. based on:
test mat.
Remarks on result:
other: QSAR predicted value

WSKOWIN predicted that the constituent DPE777799 has a water solubility of 8.83E-014 mg/L

WATERNT predicted that the constituent DPE777799 has a water solubility of 9.83E-007 mg/L

Conclusions:
Water solubility of DPE777799 is predicted to be <0.00001 mg/L
Endpoint:
water solubility
Type of information:
(Q)SAR
Adequacy of study:
weight of evidence
Study period:
December 2018
Reliability:
2 (reliable with restrictions)
Justification for type of information:
1. SOFTWARE:
programs WSKOWIN and WATERNT included in EPISUITE (Estimation Programs Interface Suite™ for Microsoft® Windows, v 4.11)


2. MODEL (incl. version number)

WSKowWin v1.42 estimates the water solubility (WSol) of an organic compound using the compound’s log octanol-water partition coefficient (Kow). WSKOWWIN requires only a chemical structure to estimate Wsol.
The estimation methodology used by WSKOWWIN (Meylan and Howard, 1994a,b) is described in the following document prepared for the U.S. Environmental Protection Agency (OPPT): Upgrade of PCGEMS Water Solubility Estimation Method (May 1994). A companion document (Validation of Water Solubility Estimation Methods Using Log Kow for Application in PCGEMS & EPI) also discusses the methodology. A journal article that describes the methodology is also available (Meylan et al., 1996).

The Water Solubility Program (WATERNT v1.01) estimates the water solubility of organic compounds at 25ºC. WATERNT requires only a chemical structure to estimate a solubility. Structures are entered into WATERNT by SMILES (Simplified Molecular Input Line Entry System) notations.


3. SMILES OR OTHER IDENTIFIERS USED AS INPUT FOR THE MODEL

SMILES : O=C(CCCCCC)OCC(COC(=O)CC(C)CC(C)(C)C)(COC(=O)CC(C)CC(C)(C)C)COCC(COC(=O)CCCCCC)(COC(=O)CC(C)CC(C)(C)C)COC(=O)CCCCCC

4. SCIENTIFIC VALIDITY OF THE (Q)SAR MODEL

- Defined endpoint:
Water solubility of an organic compound, maximum amount of substance that can be dissolved in water (g/L)

- Unambiguous algorithm:
WSKOWWIN estimates the water solubility (WSol) of an organic compound using the compound’s log octanol-water partition coefficient (Kow). WSKOWWIN requires only a chemical structure to estimate Wsol.
The estimation methodology used by WSKOWWIN (Meylan and Howard, 1994a,b) is described in the following document prepared for the U.S. Environmental Protection Agency (OPPT): Upgrade of PCGEMS Water Solubility Estimation Method (May 1994). A companion document (Validation of Water Solubility Estimation Methods Using Log Kow for Application in PCGEMS & EPI) also discusses the methodology. A journal article that describes the methodology is also available (Meylan et al., 1996).

WSKOWWIN uses equations 19 and 20 from these documents because they are the best available equations for estimating Wsol:
Equation 19 is: log S (mol/L) = 0.796 - 0.854 log Kow - 0.00728 MW + Corrections
Equation 20 is: log S (mol/L) = 0.693 - 0.96 log Kow - 0.0092(Tm-25) - 0.00314 MW + Corrections
where MW is molecular weight, Tm is melting point (MP) in deg C (used only for solids). Corrections are applied to 15 structure types (eg. alcohols, acids, selected phenols, nitros, amines, alkyl pyridines, amino acids, PAHS, multi-nitrogen types, etc).

Equation 20 is used when a measured MP is available (like in the case of study, EC 945-883-1); otherwise, equation 19 is used. These equations were derived from a dataset consisting of 1450 compounds with measured log Kow, water sol, and MP. Eq 20 has the following statistical accuracy: correlation coefficient (r2) = 0.97, standard deviation = 0.409 log units, and absolute mean error = 0.313 log units. Application to a validation dataset of 817 compounds gave the following statistical accuracy: correlation coefficient (r2) = 0.902, standard deviation = 0.615 log units, and absolute mean error = 0.480 log units.

WSKOWWIN estimates a log Kow for every SMILES notation by using the estimation engine from the KOWWIN Program (SRC, 2000). WSKOWWIN also automatically retrieves experimental log Kow values from a database containing more than 13200 organic compounds with reliably measured values. When a SMILES structure matches a database structure (via an exact atom-to-atom connection match), the experimental log Kow value is retrieved and used to predict Wsol rather than the estimated value.

WSKOWWIN v1.4 includes an experimental water solubility database of 6230 compounds. When experimental data are available for the SMILES being estimated, the data are retrieved and shown in the Results Window.
A complete description of the estimation methodology used by WSKOWWIN is available in two documents prepared for the U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics (Meylan and Howard, 1994a,b). A journal article that describes the methodology is also available (Meylan et al., 1996).  The WSKOWWIN program estimates the water solubility of an organic compound using the compounds log octanol-water partition coefficient (log Kow). A brief description is given below.

Data Collection
A database of more than 8400 compounds with reliably measured log Kow values had already been compiled from available sources.  Most experimental values were taken from a "star-list" compilation of Hansch and Leo (1985) that had already been critically evaluated (see also Hansch et al, 1995) or an extensive compilation by Sangster (1993) that includes many "recommended" values based upon critical evaluation.  Other log Kow values were taken from sources located through the Environmental Fate Data Base (EFDB) system (Howard et al, 1982, 1986).  A few values were taken from Section 4a, 8d, and 8e submissions the to U.S. EPA under the Toxic Substances Control Act (see http://www.syrres.com/esc/tscats_info.htm).

Water solubilities were collected from the AQUASOL dATAbASETM of the University of Arizona (Yalkowsky and Dannenfelser, 1990), Syracuse Research Corporation's PHYSPROP© Database (SRC,1994), and sources located through the Environmental Fate Data Base (EFDB) system (Howard et al, 1982, 1986).  Water solubilities were primarily constrained to the 20-25oC temperature range with 25oC being preferred.

Melting points were collected from sources such as AQUASOL dATAbASETM,  PHYSPROP©, and EDFB as well as the Handbook of Chemistry and Physics (Lide, 1990) and the Aldrich Catalog (Aldrich, 1992).
Regression & Results

A dataset of 1450 compounds (941 solids, 509 liquids) having reliably measured water solubility, log Kow and melting point was used as the training set for developing the new estimation algorithms for water solubility.  Standard linear regressions were used to fit  water solubility (as log S) with log Kow, melting point and molecular weight.

Residual errors from the initial regression fit were examined for compounds sharing common structural features with relatively consistent errors.  On that basis, 12 compound classes were initially identified and added to the regression to comprise a multi-linear regression including log Kow, melting point and/or molecular weight plus 12 correction factors.  Each correction factor is counted a maximum of once per structure [if applicable], no matter how many times the applicable fragment occurs.  For example, the nitro factor in 1,4-dinitrobenzene is counted just once.  A compound either contains a correction factor or it doesn't; therefore, the matrix for the multi-linear regression contained either a 0 or 1 for each correction factor. Appendix E describes the correction factors and coefficients used by WSKOWWIN.

WSKOWWIN estimates water solubility for any compound with one of two possible equations.  The equations are equations 19 and 20 from Meylan and Howard (1994a) or equations 11 and 12 from the journal article (Meylan et al., 1996).  The equations are:

     log S (mol/L)  =  0.796 - 0.854 log Kow - 0.00728 MW + ΣCorrections

    log S (mol/L)  =  0.693 - 0.96 log Kow - 0.0092(Tm-25) - 0.00314 MW + ΣCorrections

(where MW is molecular weight, Tm is melting point (MP) in deg C [used only for solids]) ... Summation of Corrections (ΣCorrections) are applied as described in Appendix E.   When a measured MP is available, that equation is used; otherwise, the equation with just MW is used.


The WATERNT program and estimation methodology were developed at Syracuse Research Corporation for the US Environmental Protection Agency (described inthe document Preliminary Report: Water Solubility Estimation by Base Compound Modification (Sept 1995). The estimation methodology is based upon a "fragment constant" method very similar to the method of the KOWWIN Program which estimates octanol-water partition coefficients. A journal article by Meylan and Howard (1995) describes the KOWWIN program methodology.

In a "fragment constant" method, a structure is divided into fragments (atom or larger functional groups) and coefficient values of each fragment or group are summed together to yield the solubility estimate. We call WATERNT’s methodology the Atom/Fragment Contribution (AFC) method. Coefficients for individual fragments and groups in WATERNT were derived by multiple regression of 1000 reliably measured water solubility values.

To estimate water solubility, WATERNT initially separates a molecule into distinct atom/fragments. In general, each non-hydrogen atom (e.g. carbon, nitrogen, oxygen, sulfur, etc.) in a structure is a "core" for a fragment; the exact fragment is determined by what is connected to the atom. Several functional groups are treated as core "atoms"; these include carbonyl (C=O), thiocarbonyl (C=S), nitro (-NO2), nitrate (ONO2), cyano (-C/N), and isothiocyanate (-N=C=S). Connections to each core "atom" are either general or specific; specific connections take precedence over general connections. For example, aromatic carbon, aromatic oxygen and aromatic sulfur atoms have nothing but general connections; i.e., the fragment is the same no matter what is connected to the atom. In contrast, there are 5 aromatic nitrogen fragments: (a) in a five-member ring, (b) in a six-member ring, (c) if the nitrogen is an oxide-type {i.e. pyridine oxide}, (d) if the nitrogen has a fused ring location {i.e. indolizine}, and (e) if the nitrogen has a +5 valence {i.e. N-methyl pyridinium iodide}; since the oxide-type is most specific, it takes precedence over the other four. The aliphatic carbon atom is another example; it does not matter what is connected to -CH3, -CH2-, or -CH< , the fragment is the same; however, an aliphatic carbon with no hydrogens has two possible fragments: (a) if there are four single bonds with 3 or more carbon connections and (b) any other not meeting the first criteria.

It became apparent, for various types of structures, that water solubility estimates made from atom/fragment values alone could or needed to be improved by inclusion of substructures larger or more complex than "atoms"; hence, correction factors were added to the AFC method. The term "correction factor" is appropriate because their values are derived from the differences between the water solubility estimates from atoms alone and the measured water solubility values. The correction factors have two main groupings: first, factors involving aromatic ring substituent positions and second, miscellaneous factors. In general, the correction factors are values for various steric interactions, hydrogen-bondings, and effects from polar functional substructures. Individual correction factors were selected through a tedious process of correlating the differences (between solubility estimates from atom/fragments alone and measured solubility values) with common substructures.

Results of two successive multiple regressions (first for atom/fragments and second for correction factors) yield the following general equation for estimating water solubility of any organic compound:

log WatSol (moles/L) = Σ(fi * ni) + Σ(cj * nj) + 0.24922
(n = 1128, correlation coef (r2) = 0.940, standard deviation = 0.537, avg deviation = 0.355)
where Σ(fi * ni) is the summation of fi (the coefficient for each atom/fragment) times ni (the number of times the atom/fragment occurs in the structure) and Σ(cj * nj) is the summation of cj (the coefficient for each correction factor) times nj (the number of times the correction factor is applied in the molecule).

Additionally, WATERNT automatically retrieves experimental water solubility values from a database containing more than 6200 organic compounds with measured values. When a SMILES structure matches a database structure (via an exact atom-to-atom connection match), the experimental water solubility value is retrieved and shown in the Results Window.


 - Defined domain of applicability:
(see point 5 below)

- Appropriate measures of goodness-of-fit and robustness and predictivity:
WSKOWWIN
The regression equations used by the WSKOWWIN program were trained with a dataset of 1450 compounds. WSKOWWIN estimates water solubility with one of two possible equations.  When an experimental melting point is available, WSKOWWIN applies the equation containing both a melting point and the molecular weight (MW) parameters.  In the absence of a melting point, the equation containing just the molecular weight is used to make the estimate.  All compounds in the 1450 compound training set have known melting points or are known to be liquids at 25oC.  The accuracy statistics for the two equations are as follows:

Melt Pt + MW MW only
r2 0.970 0.934
std deviation 0.409 0.585
avg deviation 0.313 0.442

Validation

The WSKOWWIN estimation equations were initially validated on two datasets of compounds that were not included in the model training.  A relatively small dataset was tested that consisted of 85 compounds having experimental log Kow values, but no available melting points.  Many compounds in the 85 compound test set decompose before melting and would theoretically have very high melting points (e.g. amino acids and compounds having multiple nitrogens).  The accuracy statistics for the equation used by WSKOWWIN are:

number 85
r2 0.865
std deviation 0.961
avg deviation 0.714
 
A much larger dataset of 817 compounds was also tested.  All 817 compounds had experimental melting points, but none of the 817 compounds had a reliable experimental log Kow.  The log Kow values used for the validation-testing were estimated (primarily using the KOWWIN program available at that time); therefore, the water solubility estimates are based on estimates for log Kow.  Typically, estimates based on estimates reduce estimation accuracy, but this type of validation can provide insight into the ability of the method.  The accuracy statistics for this dataset are:

number 817
r2 0.902
std deviation 0.615
avg deviation 0.480
  
Availability of Training & Validation Datasets

The complete datasets used to train and validate the SAR equations used by the WSKOWWIN program are available in two documents prepared for the U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics (Meylan and Howard, 1994a,b).  These documents, which also detail the estimation methodology, can be downloaded from the Internet at:

http://esc.syrres.com/interkow/EpiSuiteData.htm

Substructure searchable formats of the data can be download at:
http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm


WATERNT
Training Set Data
The following figure illustrates the estimation accuracy of the current WATERNT Program training set:
n= 1128
r2= 0.940
std dev= 0.537
avg dev= 0.355

Validation Data Sets
Currently, WATERNT has been tested on a validation dataset of 4,636 compounds not included in the training set.  These 4636 compounds were collected from the PHYSPROP Database.  Various compounds having experimental water solubility values in PHYSPROP were excluded from the validation set (these included the majority of compounds that were inorganics, had measurements outside a temperature range of 10 to 40 degrees C, or were measured at specific pH values that might skew an estimation comparison).  The complete training and validation data sets are available as noted below.
 
The validation set includes a diverse selection of chemical structures that rigorously test the predictive accuracy of any model.  It contains many chemicals that are similar in structure to chemicals in the training set, but also many chemicals that are different from and structurally more complex than chemicals in the training set.  Statistical performance for estimated vs experimental log WatSol (moles/L) are as follows:
n= 4636
r2= 0.815
std dev= 1.045
avg dev= 0.796

Availability of Training and Validation Datasets

The complete training and validation data sets can be downloaded from the Internet at:
http://esc.syrres.com/interkow/EpiSuiteData.htm

Substructure searchable formats of the data can be downloaded at:
http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm


- Mechanistic interpretation:


5. APPLICABILITY DOMAIN
WSKOWWIN
Appendix E gives the number compounds in the 1450 compound training set containing each of the correction factors.  The WSKOWWIN program applies an individual correction factor only once per structure [if at all] regardless of how many instances of the applicable structural feature occur in the structure.  The minimum number of instances is zero and the maximum is one.

 

Range of water solubilities in the Training set:
Minimum  =  4 x 10-7 mg/L (octachlorodibenzo-p-dioxin)
Maximum =  completely soluble (various)

Range of Molecular Weights in the Training set:
Minimum  =  27.03 (hydrocyanic acid)
Maximum =  627.62 (hexabromobiphenyl)

Range of Log Kow values in the Training set:
Minimum  =  -3.89 (aspartic acid)
Maximum =  8.27 (decachlorobiphenyl)

Currently there is no universally accepted definition of model domain.  However, users may wish to consider the possibility that water solubility estimates are less accurate for compounds outside the MW range, water solubility range and log Kow range of the training set compounds.  It is also possible that a compound may have a functional group(s) or other structural features not represented in the training set, and for which no correction factor was developed.  These points should be taken into consideration when interpreting model results.

WATERNT
Appendix D lists (for each fragment and correction factor) the maximum number of instances of that fragment in any of the 1128 training set compounds (the minimum number of instances is of course zero, since not all compounds had every fragment).  The minimum and maximum values for molecular weight are the following:

Training Set Molecular Weights:
Minimum MW:   30.30 (formaldehyde)
Maximum MW:  627.62 (hexabromobiphenyl)
Average MW:     187.73


Training Set Water Solubility Ranges:
Minimum Solubility (mg/L):   0.0000004  (octachlorodibenzo-p-dioxin)
Minimum Solubility (log moles/L):  -12.0605  (octachlorodibenzo-p-dioxin)
Maximum Solubility (mg/L):  miscible  (various)
Maximum Solubility (log moles/L):  1.3561  (acetaldehyde)
 
Currently there is no universally accepted definition of model domain.  However, users may wish to consider the possibility that water solubility estimates are less accurate for compounds outside the MW range of the training set compounds, and/or that have more instances of a given fragment than the maximum for all training set compounds.  It is also possible that a compound may have a functional group(s) or other structural features not represented in the training set, and for which no fragment coefficient was developed.  These points should be taken into consideration when interpreting model results.
 
6. ADEQUACY OF THE RESULT
The molecular weight of the substance is above the range of molecular weight in the training sets of both models. However, the calcualted values agree with the expectation of very low solubility, as observed experimentally and are cosidered as valid for supporting the limit solubility value obtained at the laboratory.
Reason / purpose for cross-reference:
(Q)SAR model reporting (QMRF)
Reason / purpose for cross-reference:
other: QPRF constituent #4
Guideline:
other:
Version / remarks:
REACH Guidance on QSARs R.6
Principles of method if other than guideline:
WSKowWIN:
Meylan, W.M. and P.H. Howard.    1994a.    Upgrade of PCGEMS Water Solubility Estimation Method (May 1994 Draft).  prepared for Robert S. Boethling, U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics, Washington, DC;  prepared by Syracuse Research Corporation, Environmental Science Center, Syracuse, NY 13210.

WATERNT:
Meylan, W.M. and P.H. Howard, 1995 Atom/fragment contribution method for estimating octanol-water partition coefficients. J. Pharm. Sci. 84: 83-92
GLP compliance:
no
Specific details on test material used for the study:
SMILES : O=C(CCCCCC)OCC(COC(=O)CC(C)CC(C)(C)C)(COC(=O)CC(C)CC(C)(C)C)COCC(COC(=O)CCCCCC)(COC(=O)CC(C)CC(C)(C)C)COC(=O)CCCCCC
Water solubility:
< 0 mg/L
Conc. based on:
test mat.
Remarks on result:
other: QSAR predicted value

WSKOWIN predicted that the constituent DPE777999 has a water solubility of 1.27E-014 mg/L

WATERNT predicted that the constituent DPE777999 has a water solubility of 1.01E-007 mg/L

Conclusions:
Water solubility of DPE777999 is predicted to be <0.00001 mg/L
Endpoint:
water solubility
Type of information:
(Q)SAR
Adequacy of study:
weight of evidence
Study period:
December 2018
Reliability:
2 (reliable with restrictions)
Justification for type of information:
1. SOFTWARE:
programs WSKOWIN and WATERNT included in EPISUITE (Estimation Programs Interface Suite™ for Microsoft® Windows, v 4.11)


2. MODEL (incl. version number)

WSKowWin v1.42 estimates the water solubility (WSol) of an organic compound using the compound’s log octanol-water partition coefficient (Kow). WSKOWWIN requires only a chemical structure to estimate Wsol.
The estimation methodology used by WSKOWWIN (Meylan and Howard, 1994a,b) is described in the following document prepared for the U.S. Environmental Protection Agency (OPPT): Upgrade of PCGEMS Water Solubility Estimation Method (May 1994). A companion document (Validation of Water Solubility Estimation Methods Using Log Kow for Application in PCGEMS & EPI) also discusses the methodology. A journal article that describes the methodology is also available (Meylan et al., 1996).

The Water Solubility Program (WATERNT v1.01) estimates the water solubility of organic compounds at 25ºC. WATERNT requires only a chemical structure to estimate a solubility. Structures are entered into WATERNT by SMILES (Simplified Molecular Input Line Entry System) notations.


3. SMILES OR OTHER IDENTIFIERS USED AS INPUT FOR THE MODEL

SMILES : O=C(CCCCCC)OCC(COC(=O)CC(C)CC(C)(C)C)(COC(=O)CC(C)CC(C)(C)C)COCC(COC(=O)CCCCCC)(COC(=O)CC(C)CC(C)(C)C)COC(=O)CCCCCC

4. SCIENTIFIC VALIDITY OF THE (Q)SAR MODEL

- Defined endpoint:
Water solubility of an organic compound, maximum amount of substance that can be dissolved in water (g/L)

- Unambiguous algorithm:
WSKOWWIN estimates the water solubility (WSol) of an organic compound using the compound’s log octanol-water partition coefficient (Kow). WSKOWWIN requires only a chemical structure to estimate Wsol.
The estimation methodology used by WSKOWWIN (Meylan and Howard, 1994a,b) is described in the following document prepared for the U.S. Environmental Protection Agency (OPPT): Upgrade of PCGEMS Water Solubility Estimation Method (May 1994). A companion document (Validation of Water Solubility Estimation Methods Using Log Kow for Application in PCGEMS & EPI) also discusses the methodology. A journal article that describes the methodology is also available (Meylan et al., 1996).

WSKOWWIN uses equations 19 and 20 from these documents because they are the best available equations for estimating Wsol:
Equation 19 is: log S (mol/L) = 0.796 - 0.854 log Kow - 0.00728 MW + Corrections
Equation 20 is: log S (mol/L) = 0.693 - 0.96 log Kow - 0.0092(Tm-25) - 0.00314 MW + Corrections
where MW is molecular weight, Tm is melting point (MP) in deg C (used only for solids). Corrections are applied to 15 structure types (eg. alcohols, acids, selected phenols, nitros, amines, alkyl pyridines, amino acids, PAHS, multi-nitrogen types, etc).

Equation 20 is used when a measured MP is available (like in the case of study, EC 945-883-1); otherwise, equation 19 is used. These equations were derived from a dataset consisting of 1450 compounds with measured log Kow, water sol, and MP. Eq 20 has the following statistical accuracy: correlation coefficient (r2) = 0.97, standard deviation = 0.409 log units, and absolute mean error = 0.313 log units. Application to a validation dataset of 817 compounds gave the following statistical accuracy: correlation coefficient (r2) = 0.902, standard deviation = 0.615 log units, and absolute mean error = 0.480 log units.

WSKOWWIN estimates a log Kow for every SMILES notation by using the estimation engine from the KOWWIN Program (SRC, 2000). WSKOWWIN also automatically retrieves experimental log Kow values from a database containing more than 13200 organic compounds with reliably measured values. When a SMILES structure matches a database structure (via an exact atom-to-atom connection match), the experimental log Kow value is retrieved and used to predict Wsol rather than the estimated value.

WSKOWWIN v1.4 includes an experimental water solubility database of 6230 compounds. When experimental data are available for the SMILES being estimated, the data are retrieved and shown in the Results Window.
A complete description of the estimation methodology used by WSKOWWIN is available in two documents prepared for the U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics (Meylan and Howard, 1994a,b). A journal article that describes the methodology is also available (Meylan et al., 1996).  The WSKOWWIN program estimates the water solubility of an organic compound using the compounds log octanol-water partition coefficient (log Kow). A brief description is given below.

Data Collection
A database of more than 8400 compounds with reliably measured log Kow values had already been compiled from available sources.  Most experimental values were taken from a "star-list" compilation of Hansch and Leo (1985) that had already been critically evaluated (see also Hansch et al, 1995) or an extensive compilation by Sangster (1993) that includes many "recommended" values based upon critical evaluation.  Other log Kow values were taken from sources located through the Environmental Fate Data Base (EFDB) system (Howard et al, 1982, 1986).  A few values were taken from Section 4a, 8d, and 8e submissions the to U.S. EPA under the Toxic Substances Control Act (see http://www.syrres.com/esc/tscats_info.htm).

Water solubilities were collected from the AQUASOL dATAbASETM of the University of Arizona (Yalkowsky and Dannenfelser, 1990), Syracuse Research Corporation's PHYSPROP© Database (SRC,1994), and sources located through the Environmental Fate Data Base (EFDB) system (Howard et al, 1982, 1986).  Water solubilities were primarily constrained to the 20-25oC temperature range with 25oC being preferred.

Melting points were collected from sources such as AQUASOL dATAbASETM,  PHYSPROP©, and EDFB as well as the Handbook of Chemistry and Physics (Lide, 1990) and the Aldrich Catalog (Aldrich, 1992).
Regression & Results

A dataset of 1450 compounds (941 solids, 509 liquids) having reliably measured water solubility, log Kow and melting point was used as the training set for developing the new estimation algorithms for water solubility.  Standard linear regressions were used to fit  water solubility (as log S) with log Kow, melting point and molecular weight.

Residual errors from the initial regression fit were examined for compounds sharing common structural features with relatively consistent errors.  On that basis, 12 compound classes were initially identified and added to the regression to comprise a multi-linear regression including log Kow, melting point and/or molecular weight plus 12 correction factors.  Each correction factor is counted a maximum of once per structure [if applicable], no matter how many times the applicable fragment occurs.  For example, the nitro factor in 1,4-dinitrobenzene is counted just once.  A compound either contains a correction factor or it doesn't; therefore, the matrix for the multi-linear regression contained either a 0 or 1 for each correction factor. Appendix E describes the correction factors and coefficients used by WSKOWWIN.

WSKOWWIN estimates water solubility for any compound with one of two possible equations.  The equations are equations 19 and 20 from Meylan and Howard (1994a) or equations 11 and 12 from the journal article (Meylan et al., 1996).  The equations are:

     log S (mol/L)  =  0.796 - 0.854 log Kow - 0.00728 MW + ΣCorrections

    log S (mol/L)  =  0.693 - 0.96 log Kow - 0.0092(Tm-25) - 0.00314 MW + ΣCorrections

(where MW is molecular weight, Tm is melting point (MP) in deg C [used only for solids]) ... Summation of Corrections (ΣCorrections) are applied as described in Appendix E.   When a measured MP is available, that equation is used; otherwise, the equation with just MW is used.


The WATERNT program and estimation methodology were developed at Syracuse Research Corporation for the US Environmental Protection Agency (described inthe document Preliminary Report: Water Solubility Estimation by Base Compound Modification (Sept 1995). The estimation methodology is based upon a "fragment constant" method very similar to the method of the KOWWIN Program which estimates octanol-water partition coefficients. A journal article by Meylan and Howard (1995) describes the KOWWIN program methodology.

In a "fragment constant" method, a structure is divided into fragments (atom or larger functional groups) and coefficient values of each fragment or group are summed together to yield the solubility estimate. We call WATERNT’s methodology the Atom/Fragment Contribution (AFC) method. Coefficients for individual fragments and groups in WATERNT were derived by multiple regression of 1000 reliably measured water solubility values.

To estimate water solubility, WATERNT initially separates a molecule into distinct atom/fragments. In general, each non-hydrogen atom (e.g. carbon, nitrogen, oxygen, sulfur, etc.) in a structure is a "core" for a fragment; the exact fragment is determined by what is connected to the atom. Several functional groups are treated as core "atoms"; these include carbonyl (C=O), thiocarbonyl (C=S), nitro (-NO2), nitrate (ONO2), cyano (-C/N), and isothiocyanate (-N=C=S). Connections to each core "atom" are either general or specific; specific connections take precedence over general connections. For example, aromatic carbon, aromatic oxygen and aromatic sulfur atoms have nothing but general connections; i.e., the fragment is the same no matter what is connected to the atom. In contrast, there are 5 aromatic nitrogen fragments: (a) in a five-member ring, (b) in a six-member ring, (c) if the nitrogen is an oxide-type {i.e. pyridine oxide}, (d) if the nitrogen has a fused ring location {i.e. indolizine}, and (e) if the nitrogen has a +5 valence {i.e. N-methyl pyridinium iodide}; since the oxide-type is most specific, it takes precedence over the other four. The aliphatic carbon atom is another example; it does not matter what is connected to -CH3, -CH2-, or -CH< , the fragment is the same; however, an aliphatic carbon with no hydrogens has two possible fragments: (a) if there are four single bonds with 3 or more carbon connections and (b) any other not meeting the first criteria.

It became apparent, for various types of structures, that water solubility estimates made from atom/fragment values alone could or needed to be improved by inclusion of substructures larger or more complex than "atoms"; hence, correction factors were added to the AFC method. The term "correction factor" is appropriate because their values are derived from the differences between the water solubility estimates from atoms alone and the measured water solubility values. The correction factors have two main groupings: first, factors involving aromatic ring substituent positions and second, miscellaneous factors. In general, the correction factors are values for various steric interactions, hydrogen-bondings, and effects from polar functional substructures. Individual correction factors were selected through a tedious process of correlating the differences (between solubility estimates from atom/fragments alone and measured solubility values) with common substructures.

Results of two successive multiple regressions (first for atom/fragments and second for correction factors) yield the following general equation for estimating water solubility of any organic compound:

log WatSol (moles/L) = Σ(fi * ni) + Σ(cj * nj) + 0.24922
(n = 1128, correlation coef (r2) = 0.940, standard deviation = 0.537, avg deviation = 0.355)
where Σ(fi * ni) is the summation of fi (the coefficient for each atom/fragment) times ni (the number of times the atom/fragment occurs in the structure) and Σ(cj * nj) is the summation of cj (the coefficient for each correction factor) times nj (the number of times the correction factor is applied in the molecule).

Additionally, WATERNT automatically retrieves experimental water solubility values from a database containing more than 6200 organic compounds with measured values. When a SMILES structure matches a database structure (via an exact atom-to-atom connection match), the experimental water solubility value is retrieved and shown in the Results Window.


 - Defined domain of applicability:
(see point 5 below)

- Appropriate measures of goodness-of-fit and robustness and predictivity:
WSKOWWIN
The regression equations used by the WSKOWWIN program were trained with a dataset of 1450 compounds. WSKOWWIN estimates water solubility with one of two possible equations.  When an experimental melting point is available, WSKOWWIN applies the equation containing both a melting point and the molecular weight (MW) parameters.  In the absence of a melting point, the equation containing just the molecular weight is used to make the estimate.  All compounds in the 1450 compound training set have known melting points or are known to be liquids at 25oC.  The accuracy statistics for the two equations are as follows:

Melt Pt + MW MW only
r2 0.970 0.934
std deviation 0.409 0.585
avg deviation 0.313 0.442

Validation

The WSKOWWIN estimation equations were initially validated on two datasets of compounds that were not included in the model training.  A relatively small dataset was tested that consisted of 85 compounds having experimental log Kow values, but no available melting points.  Many compounds in the 85 compound test set decompose before melting and would theoretically have very high melting points (e.g. amino acids and compounds having multiple nitrogens).  The accuracy statistics for the equation used by WSKOWWIN are:

number 85
r2 0.865
std deviation 0.961
avg deviation 0.714
 
A much larger dataset of 817 compounds was also tested.  All 817 compounds had experimental melting points, but none of the 817 compounds had a reliable experimental log Kow.  The log Kow values used for the validation-testing were estimated (primarily using the KOWWIN program available at that time); therefore, the water solubility estimates are based on estimates for log Kow.  Typically, estimates based on estimates reduce estimation accuracy, but this type of validation can provide insight into the ability of the method.  The accuracy statistics for this dataset are:

number 817
r2 0.902
std deviation 0.615
avg deviation 0.480
  
Availability of Training & Validation Datasets

The complete datasets used to train and validate the SAR equations used by the WSKOWWIN program are available in two documents prepared for the U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics (Meylan and Howard, 1994a,b).  These documents, which also detail the estimation methodology, can be downloaded from the Internet at:

http://esc.syrres.com/interkow/EpiSuiteData.htm

Substructure searchable formats of the data can be download at:
http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm


WATERNT
Training Set Data
The following figure illustrates the estimation accuracy of the current WATERNT Program training set:
n= 1128
r2= 0.940
std dev= 0.537
avg dev= 0.355

Validation Data Sets
Currently, WATERNT has been tested on a validation dataset of 4,636 compounds not included in the training set.  These 4636 compounds were collected from the PHYSPROP Database.  Various compounds having experimental water solubility values in PHYSPROP were excluded from the validation set (these included the majority of compounds that were inorganics, had measurements outside a temperature range of 10 to 40 degrees C, or were measured at specific pH values that might skew an estimation comparison).  The complete training and validation data sets are available as noted below.
 
The validation set includes a diverse selection of chemical structures that rigorously test the predictive accuracy of any model.  It contains many chemicals that are similar in structure to chemicals in the training set, but also many chemicals that are different from and structurally more complex than chemicals in the training set.  Statistical performance for estimated vs experimental log WatSol (moles/L) are as follows:
n= 4636
r2= 0.815
std dev= 1.045
avg dev= 0.796

Availability of Training and Validation Datasets

The complete training and validation data sets can be downloaded from the Internet at:
http://esc.syrres.com/interkow/EpiSuiteData.htm

Substructure searchable formats of the data can be downloaded at:
http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm


- Mechanistic interpretation:


5. APPLICABILITY DOMAIN
WSKOWWIN
Appendix E gives the number compounds in the 1450 compound training set containing each of the correction factors.  The WSKOWWIN program applies an individual correction factor only once per structure [if at all] regardless of how many instances of the applicable structural feature occur in the structure.  The minimum number of instances is zero and the maximum is one.

 

Range of water solubilities in the Training set:
Minimum  =  4 x 10-7 mg/L (octachlorodibenzo-p-dioxin)
Maximum =  completely soluble (various)

Range of Molecular Weights in the Training set:
Minimum  =  27.03 (hydrocyanic acid)
Maximum =  627.62 (hexabromobiphenyl)

Range of Log Kow values in the Training set:
Minimum  =  -3.89 (aspartic acid)
Maximum =  8.27 (decachlorobiphenyl)

Currently there is no universally accepted definition of model domain.  However, users may wish to consider the possibility that water solubility estimates are less accurate for compounds outside the MW range, water solubility range and log Kow range of the training set compounds.  It is also possible that a compound may have a functional group(s) or other structural features not represented in the training set, and for which no correction factor was developed.  These points should be taken into consideration when interpreting model results.

WATERNT
Appendix D lists (for each fragment and correction factor) the maximum number of instances of that fragment in any of the 1128 training set compounds (the minimum number of instances is of course zero, since not all compounds had every fragment).  The minimum and maximum values for molecular weight are the following:

Training Set Molecular Weights:
Minimum MW:   30.30 (formaldehyde)
Maximum MW:  627.62 (hexabromobiphenyl)
Average MW:     187.73


Training Set Water Solubility Ranges:
Minimum Solubility (mg/L):   0.0000004  (octachlorodibenzo-p-dioxin)
Minimum Solubility (log moles/L):  -12.0605  (octachlorodibenzo-p-dioxin)
Maximum Solubility (mg/L):  miscible  (various)
Maximum Solubility (log moles/L):  1.3561  (acetaldehyde)
 
Currently there is no universally accepted definition of model domain.  However, users may wish to consider the possibility that water solubility estimates are less accurate for compounds outside the MW range of the training set compounds, and/or that have more instances of a given fragment than the maximum for all training set compounds.  It is also possible that a compound may have a functional group(s) or other structural features not represented in the training set, and for which no fragment coefficient was developed.  These points should be taken into consideration when interpreting model results.
 
6. ADEQUACY OF THE RESULT
The molecular weight of the substance is above the range of molecular weight in the training sets of both models. However, the calcualted values agree with the expectation of very low solubility, as observed experimentally and are cosidered as valid for supporting the limit solubility value obtained at the laboratory.
Reason / purpose for cross-reference:
(Q)SAR model reporting (QMRF)
Reason / purpose for cross-reference:
other: QPRF constituent #5
Guideline:
other:
Version / remarks:
REACH Guidance on QSARs R.6
Principles of method if other than guideline:
WSKowWIN:
Meylan, W.M. and P.H. Howard.    1994a.    Upgrade of PCGEMS Water Solubility Estimation Method (May 1994 Draft).  prepared for Robert S. Boethling, U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics, Washington, DC;  prepared by Syracuse Research Corporation, Environmental Science Center, Syracuse, NY 13210.

WATERNT:
Meylan, W.M. and P.H. Howard, 1995 Atom/fragment contribution method for estimating octanol-water partition coefficients. J. Pharm. Sci. 84: 83-92
GLP compliance:
no
Specific details on test material used for the study:
SMILES : O=C(CCCCCC)OCC(COC(=O)CC(C)CC(C)(C)C)(COC(=O)CC(C)CC(C)(C)C)COCC(COC(=O)CCCCCC)(COC(=O)CC(C)CC(C)(C)C)COC(=O)CCCCCC
Water solubility:
< 0 mg/L
Conc. based on:
test mat.
Remarks on result:
other: QSAR predicted value

WSKOWIN predicted that the constituent DPE779999 has a water solubility of 1.83E-015 mg/L

WATERNT predicted that the constituent DPE779999 has a water solubility of 1.04E-007 mg/L

Conclusions:
Water solubility of DPE779999 is predicted to be <0.00001 mg/L
Endpoint:
water solubility
Type of information:
(Q)SAR
Adequacy of study:
weight of evidence
Study period:
December 2018
Reliability:
2 (reliable with restrictions)
Justification for type of information:
1. SOFTWARE:
programs WSKOWIN and WATERNT included in EPISUITE (Estimation Programs Interface Suite™ for Microsoft® Windows, v 4.11)


2. MODEL (incl. version number)

WSKowWin v1.42 estimates the water solubility (WSol) of an organic compound using the compound’s log octanol-water partition coefficient (Kow). WSKOWWIN requires only a chemical structure to estimate Wsol.
The estimation methodology used by WSKOWWIN (Meylan and Howard, 1994a,b) is described in the following document prepared for the U.S. Environmental Protection Agency (OPPT): Upgrade of PCGEMS Water Solubility Estimation Method (May 1994). A companion document (Validation of Water Solubility Estimation Methods Using Log Kow for Application in PCGEMS & EPI) also discusses the methodology. A journal article that describes the methodology is also available (Meylan et al., 1996).

The Water Solubility Program (WATERNT v1.01) estimates the water solubility of organic compounds at 25ºC. WATERNT requires only a chemical structure to estimate a solubility. Structures are entered into WATERNT by SMILES (Simplified Molecular Input Line Entry System) notations.


3. SMILES OR OTHER IDENTIFIERS USED AS INPUT FOR THE MODEL
SMILES : O=C(CC(C)CC(C)(C)C)OCC(COC(=O)CC(C)CC(C)(C)C)(COC(=O)CC(C)CC(C)(C)C)COCC(COC(=O)CC(C)CC(C)(C)C)(COC(=O)CC(C)CC(C)(C)C)COC(=O)CC(C)CC(C)(C)

4. SCIENTIFIC VALIDITY OF THE (Q)SAR MODEL

- Defined endpoint:
Water solubility of an organic compound, maximum amount of substance that can be dissolved in water (g/L)

- Unambiguous algorithm:
WSKOWWIN estimates the water solubility (WSol) of an organic compound using the compound’s log octanol-water partition coefficient (Kow). WSKOWWIN requires only a chemical structure to estimate Wsol.
The estimation methodology used by WSKOWWIN (Meylan and Howard, 1994a,b) is described in the following document prepared for the U.S. Environmental Protection Agency (OPPT): Upgrade of PCGEMS Water Solubility Estimation Method (May 1994). A companion document (Validation of Water Solubility Estimation Methods Using Log Kow for Application in PCGEMS & EPI) also discusses the methodology. A journal article that describes the methodology is also available (Meylan et al., 1996).

WSKOWWIN uses equations 19 and 20 from these documents because they are the best available equations for estimating Wsol:
Equation 19 is: log S (mol/L) = 0.796 - 0.854 log Kow - 0.00728 MW + Corrections
Equation 20 is: log S (mol/L) = 0.693 - 0.96 log Kow - 0.0092(Tm-25) - 0.00314 MW + Corrections
where MW is molecular weight, Tm is melting point (MP) in deg C (used only for solids). Corrections are applied to 15 structure types (eg. alcohols, acids, selected phenols, nitros, amines, alkyl pyridines, amino acids, PAHS, multi-nitrogen types, etc).

Equation 20 is used when a measured MP is available (like in the case of study, EC 945-883-1); otherwise, equation 19 is used. These equations were derived from a dataset consisting of 1450 compounds with measured log Kow, water sol, and MP. Eq 20 has the following statistical accuracy: correlation coefficient (r2) = 0.97, standard deviation = 0.409 log units, and absolute mean error = 0.313 log units. Application to a validation dataset of 817 compounds gave the following statistical accuracy: correlation coefficient (r2) = 0.902, standard deviation = 0.615 log units, and absolute mean error = 0.480 log units.

WSKOWWIN estimates a log Kow for every SMILES notation by using the estimation engine from the KOWWIN Program (SRC, 2000). WSKOWWIN also automatically retrieves experimental log Kow values from a database containing more than 13200 organic compounds with reliably measured values. When a SMILES structure matches a database structure (via an exact atom-to-atom connection match), the experimental log Kow value is retrieved and used to predict Wsol rather than the estimated value.

WSKOWWIN v1.4 includes an experimental water solubility database of 6230 compounds. When experimental data are available for the SMILES being estimated, the data are retrieved and shown in the Results Window.
A complete description of the estimation methodology used by WSKOWWIN is available in two documents prepared for the U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics (Meylan and Howard, 1994a,b). A journal article that describes the methodology is also available (Meylan et al., 1996).  The WSKOWWIN program estimates the water solubility of an organic compound using the compounds log octanol-water partition coefficient (log Kow). A brief description is given below.

Data Collection
A database of more than 8400 compounds with reliably measured log Kow values had already been compiled from available sources.  Most experimental values were taken from a "star-list" compilation of Hansch and Leo (1985) that had already been critically evaluated (see also Hansch et al, 1995) or an extensive compilation by Sangster (1993) that includes many "recommended" values based upon critical evaluation.  Other log Kow values were taken from sources located through the Environmental Fate Data Base (EFDB) system (Howard et al, 1982, 1986).  A few values were taken from Section 4a, 8d, and 8e submissions the to U.S. EPA under the Toxic Substances Control Act (see http://www.syrres.com/esc/tscats_info.htm).

Water solubilities were collected from the AQUASOL dATAbASETM of the University of Arizona (Yalkowsky and Dannenfelser, 1990), Syracuse Research Corporation's PHYSPROP© Database (SRC,1994), and sources located through the Environmental Fate Data Base (EFDB) system (Howard et al, 1982, 1986).  Water solubilities were primarily constrained to the 20-25oC temperature range with 25oC being preferred.

Melting points were collected from sources such as AQUASOL dATAbASETM,  PHYSPROP©, and EDFB as well as the Handbook of Chemistry and Physics (Lide, 1990) and the Aldrich Catalog (Aldrich, 1992).
Regression & Results

A dataset of 1450 compounds (941 solids, 509 liquids) having reliably measured water solubility, log Kow and melting point was used as the training set for developing the new estimation algorithms for water solubility.  Standard linear regressions were used to fit  water solubility (as log S) with log Kow, melting point and molecular weight.

Residual errors from the initial regression fit were examined for compounds sharing common structural features with relatively consistent errors.  On that basis, 12 compound classes were initially identified and added to the regression to comprise a multi-linear regression including log Kow, melting point and/or molecular weight plus 12 correction factors.  Each correction factor is counted a maximum of once per structure [if applicable], no matter how many times the applicable fragment occurs.  For example, the nitro factor in 1,4-dinitrobenzene is counted just once.  A compound either contains a correction factor or it doesn't; therefore, the matrix for the multi-linear regression contained either a 0 or 1 for each correction factor. Appendix E describes the correction factors and coefficients used by WSKOWWIN.

WSKOWWIN estimates water solubility for any compound with one of two possible equations.  The equations are equations 19 and 20 from Meylan and Howard (1994a) or equations 11 and 12 from the journal article (Meylan et al., 1996).  The equations are:

     log S (mol/L)  =  0.796 - 0.854 log Kow - 0.00728 MW + ΣCorrections

    log S (mol/L)  =  0.693 - 0.96 log Kow - 0.0092(Tm-25) - 0.00314 MW + ΣCorrections

(where MW is molecular weight, Tm is melting point (MP) in deg C [used only for solids]) ... Summation of Corrections (ΣCorrections) are applied as described in Appendix E.   When a measured MP is available, that equation is used; otherwise, the equation with just MW is used.


The WATERNT program and estimation methodology were developed at Syracuse Research Corporation for the US Environmental Protection Agency (described inthe document Preliminary Report: Water Solubility Estimation by Base Compound Modification (Sept 1995). The estimation methodology is based upon a "fragment constant" method very similar to the method of the KOWWIN Program which estimates octanol-water partition coefficients. A journal article by Meylan and Howard (1995) describes the KOWWIN program methodology.

In a "fragment constant" method, a structure is divided into fragments (atom or larger functional groups) and coefficient values of each fragment or group are summed together to yield the solubility estimate. We call WATERNT’s methodology the Atom/Fragment Contribution (AFC) method. Coefficients for individual fragments and groups in WATERNT were derived by multiple regression of 1000 reliably measured water solubility values.

To estimate water solubility, WATERNT initially separates a molecule into distinct atom/fragments. In general, each non-hydrogen atom (e.g. carbon, nitrogen, oxygen, sulfur, etc.) in a structure is a "core" for a fragment; the exact fragment is determined by what is connected to the atom. Several functional groups are treated as core "atoms"; these include carbonyl (C=O), thiocarbonyl (C=S), nitro (-NO2), nitrate (ONO2), cyano (-C/N), and isothiocyanate (-N=C=S). Connections to each core "atom" are either general or specific; specific connections take precedence over general connections. For example, aromatic carbon, aromatic oxygen and aromatic sulfur atoms have nothing but general connections; i.e., the fragment is the same no matter what is connected to the atom. In contrast, there are 5 aromatic nitrogen fragments: (a) in a five-member ring, (b) in a six-member ring, (c) if the nitrogen is an oxide-type {i.e. pyridine oxide}, (d) if the nitrogen has a fused ring location {i.e. indolizine}, and (e) if the nitrogen has a +5 valence {i.e. N-methyl pyridinium iodide}; since the oxide-type is most specific, it takes precedence over the other four. The aliphatic carbon atom is another example; it does not matter what is connected to -CH3, -CH2-, or -CH< , the fragment is the same; however, an aliphatic carbon with no hydrogens has two possible fragments: (a) if there are four single bonds with 3 or more carbon connections and (b) any other not meeting the first criteria.

It became apparent, for various types of structures, that water solubility estimates made from atom/fragment values alone could or needed to be improved by inclusion of substructures larger or more complex than "atoms"; hence, correction factors were added to the AFC method. The term "correction factor" is appropriate because their values are derived from the differences between the water solubility estimates from atoms alone and the measured water solubility values. The correction factors have two main groupings: first, factors involving aromatic ring substituent positions and second, miscellaneous factors. In general, the correction factors are values for various steric interactions, hydrogen-bondings, and effects from polar functional substructures. Individual correction factors were selected through a tedious process of correlating the differences (between solubility estimates from atom/fragments alone and measured solubility values) with common substructures.

Results of two successive multiple regressions (first for atom/fragments and second for correction factors) yield the following general equation for estimating water solubility of any organic compound:

log WatSol (moles/L) = Σ(fi * ni) + Σ(cj * nj) + 0.24922
(n = 1128, correlation coef (r2) = 0.940, standard deviation = 0.537, avg deviation = 0.355)
where Σ(fi * ni) is the summation of fi (the coefficient for each atom/fragment) times ni (the number of times the atom/fragment occurs in the structure) and Σ(cj * nj) is the summation of cj (the coefficient for each correction factor) times nj (the number of times the correction factor is applied in the molecule).

Additionally, WATERNT automatically retrieves experimental water solubility values from a database containing more than 6200 organic compounds with measured values. When a SMILES structure matches a database structure (via an exact atom-to-atom connection match), the experimental water solubility value is retrieved and shown in the Results Window.


 - Defined domain of applicability:
(see point 5 below)

- Appropriate measures of goodness-of-fit and robustness and predictivity:
WSKOWWIN
The regression equations used by the WSKOWWIN program were trained with a dataset of 1450 compounds. WSKOWWIN estimates water solubility with one of two possible equations.  When an experimental melting point is available, WSKOWWIN applies the equation containing both a melting point and the molecular weight (MW) parameters.  In the absence of a melting point, the equation containing just the molecular weight is used to make the estimate.  All compounds in the 1450 compound training set have known melting points or are known to be liquids at 25oC.  The accuracy statistics for the two equations are as follows:

Melt Pt + MW MW only
r2 0.970 0.934
std deviation 0.409 0.585
avg deviation 0.313 0.442

Validation

The WSKOWWIN estimation equations were initially validated on two datasets of compounds that were not included in the model training.  A relatively small dataset was tested that consisted of 85 compounds having experimental log Kow values, but no available melting points.  Many compounds in the 85 compound test set decompose before melting and would theoretically have very high melting points (e.g. amino acids and compounds having multiple nitrogens).  The accuracy statistics for the equation used by WSKOWWIN are:

number 85
r2 0.865
std deviation 0.961
avg deviation 0.714
 
A much larger dataset of 817 compounds was also tested.  All 817 compounds had experimental melting points, but none of the 817 compounds had a reliable experimental log Kow.  The log Kow values used for the validation-testing were estimated (primarily using the KOWWIN program available at that time); therefore, the water solubility estimates are based on estimates for log Kow.  Typically, estimates based on estimates reduce estimation accuracy, but this type of validation can provide insight into the ability of the method.  The accuracy statistics for this dataset are:

number 817
r2 0.902
std deviation 0.615
avg deviation 0.480
  
Availability of Training & Validation Datasets

The complete datasets used to train and validate the SAR equations used by the WSKOWWIN program are available in two documents prepared for the U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics (Meylan and Howard, 1994a,b).  These documents, which also detail the estimation methodology, can be downloaded from the Internet at:

http://esc.syrres.com/interkow/EpiSuiteData.htm

Substructure searchable formats of the data can be download at:
http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm


WATERNT
Training Set Data
The following figure illustrates the estimation accuracy of the current WATERNT Program training set:
n= 1128
r2= 0.940
std dev= 0.537
avg dev= 0.355

Validation Data Sets
Currently, WATERNT has been tested on a validation dataset of 4,636 compounds not included in the training set.  These 4636 compounds were collected from the PHYSPROP Database.  Various compounds having experimental water solubility values in PHYSPROP were excluded from the validation set (these included the majority of compounds that were inorganics, had measurements outside a temperature range of 10 to 40 degrees C, or were measured at specific pH values that might skew an estimation comparison).  The complete training and validation data sets are available as noted below.
 
The validation set includes a diverse selection of chemical structures that rigorously test the predictive accuracy of any model.  It contains many chemicals that are similar in structure to chemicals in the training set, but also many chemicals that are different from and structurally more complex than chemicals in the training set.  Statistical performance for estimated vs experimental log WatSol (moles/L) are as follows:
n= 4636
r2= 0.815
std dev= 1.045
avg dev= 0.796

Availability of Training and Validation Datasets

The complete training and validation data sets can be downloaded from the Internet at:
http://esc.syrres.com/interkow/EpiSuiteData.htm

Substructure searchable formats of the data can be downloaded at:
http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm


- Mechanistic interpretation:


5. APPLICABILITY DOMAIN
WSKOWWIN
Appendix E gives the number compounds in the 1450 compound training set containing each of the correction factors.  The WSKOWWIN program applies an individual correction factor only once per structure [if at all] regardless of how many instances of the applicable structural feature occur in the structure.  The minimum number of instances is zero and the maximum is one.

 

Range of water solubilities in the Training set:
Minimum  =  4 x 10-7 mg/L (octachlorodibenzo-p-dioxin)
Maximum =  completely soluble (various)

Range of Molecular Weights in the Training set:
Minimum  =  27.03 (hydrocyanic acid)
Maximum =  627.62 (hexabromobiphenyl)

Range of Log Kow values in the Training set:
Minimum  =  -3.89 (aspartic acid)
Maximum =  8.27 (decachlorobiphenyl)

Currently there is no universally accepted definition of model domain.  However, users may wish to consider the possibility that water solubility estimates are less accurate for compounds outside the MW range, water solubility range and log Kow range of the training set compounds.  It is also possible that a compound may have a functional group(s) or other structural features not represented in the training set, and for which no correction factor was developed.  These points should be taken into consideration when interpreting model results.

WATERNT
Appendix D lists (for each fragment and correction factor) the maximum number of instances of that fragment in any of the 1128 training set compounds (the minimum number of instances is of course zero, since not all compounds had every fragment).  The minimum and maximum values for molecular weight are the following:

Training Set Molecular Weights:
Minimum MW:   30.30 (formaldehyde)
Maximum MW:  627.62 (hexabromobiphenyl)
Average MW:     187.73


Training Set Water Solubility Ranges:
Minimum Solubility (mg/L):   0.0000004  (octachlorodibenzo-p-dioxin)
Minimum Solubility (log moles/L):  -12.0605  (octachlorodibenzo-p-dioxin)
Maximum Solubility (mg/L):  miscible  (various)
Maximum Solubility (log moles/L):  1.3561  (acetaldehyde)
 
Currently there is no universally accepted definition of model domain.  However, users may wish to consider the possibility that water solubility estimates are less accurate for compounds outside the MW range of the training set compounds, and/or that have more instances of a given fragment than the maximum for all training set compounds.  It is also possible that a compound may have a functional group(s) or other structural features not represented in the training set, and for which no fragment coefficient was developed.  These points should be taken into consideration when interpreting model results.
 
6. ADEQUACY OF THE RESULT
The molecular weight of the substance is above the range of molecular weight in the training sets of both models. However, the calcualted values agree with the expectation of very low solubility, as observed experimentally and are cosidered as valid for supporting the limit solubility value obtained at the laboratory.
Reason / purpose for cross-reference:
(Q)SAR model reporting (QMRF)
Reason / purpose for cross-reference:
other: QPRF constituent #6
Guideline:
other:
Version / remarks:
REACH Guidance on QSARs R.6
Principles of method if other than guideline:
WSKowWIN:
Meylan, W.M. and P.H. Howard.    1994a.    Upgrade of PCGEMS Water Solubility Estimation Method (May 1994 Draft).  prepared for Robert S. Boethling, U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics, Washington, DC;  prepared by Syracuse Research Corporation, Environmental Science Center, Syracuse, NY 13210.

WATERNT:
Meylan, W.M. and P.H. Howard, 1995 Atom/fragment contribution method for estimating octanol-water partition coefficients. J. Pharm. Sci. 84: 83-92
GLP compliance:
no
Specific details on test material used for the study:
SMILES : O=C(CC(C)CC(C)(C)C)OCC(COC(=O)CC(C)CC(C)(C)C)(COC(=O)CC(C)CC(C)(C)C)COCC(COC(=O)CC(C)CC(C)(C)C)(COC(=O)CC(C)CC(C)(C)C)COC(=O)CC(C)CC(C)(C)
Water solubility:
< 0 mg/L
Conc. based on:
test mat.
Remarks on result:
other: QSAR predicted value

WSKOWIN predicted that the constituent DPE799999 has a water solubility of 2.63E-016 mg/L

WATERNT predicted that the constituent DPE799999 has a water solubility of 1.07E-006 mg/L

Conclusions:
Water solubility of DPE799999 is predicted to be <0.00001 mg/L
Endpoint:
water solubility
Type of information:
(Q)SAR
Adequacy of study:
weight of evidence
Study period:
December 2018
Reliability:
2 (reliable with restrictions)
Justification for type of information:
1. SOFTWARE:
programs WSKOWIN and WATERNT included in EPISUITE (Estimation Programs Interface Suite™ for Microsoft® Windows, v 4.11)


2. MODEL (incl. version number)

WSKowWin v1.42 estimates the water solubility (WSol) of an organic compound using the compound’s log octanol-water partition coefficient (Kow). WSKOWWIN requires only a chemical structure to estimate Wsol.
The estimation methodology used by WSKOWWIN (Meylan and Howard, 1994a,b) is described in the following document prepared for the U.S. Environmental Protection Agency (OPPT): Upgrade of PCGEMS Water Solubility Estimation Method (May 1994). A companion document (Validation of Water Solubility Estimation Methods Using Log Kow for Application in PCGEMS & EPI) also discusses the methodology. A journal article that describes the methodology is also available (Meylan et al., 1996).

The Water Solubility Program (WATERNT v1.01) estimates the water solubility of organic compounds at 25ºC. WATERNT requires only a chemical structure to estimate a solubility. Structures are entered into WATERNT by SMILES (Simplified Molecular Input Line Entry System) notations.


3. SMILES OR OTHER IDENTIFIERS USED AS INPUT FOR THE MODEL

SMILES : O=C(CC(C)CC(C)(C)C)OCC(COC(=O)CC(C)CC(C)(C)C)(COC(=O)CC(C)CC(C)(C)C)COCC(COC(=O)CC(C)CC(C)(C)C)(COC(=O)CC(C)CC(C)(C)C)COC(=O)CC(C)CC(C)(C)(C)

4. SCIENTIFIC VALIDITY OF THE (Q)SAR MODEL

- Defined endpoint:
Water solubility of an organic compound, maximum amount of substance that can be dissolved in water (g/L)

- Unambiguous algorithm:
WSKOWWIN estimates the water solubility (WSol) of an organic compound using the compound’s log octanol-water partition coefficient (Kow). WSKOWWIN requires only a chemical structure to estimate Wsol.
The estimation methodology used by WSKOWWIN (Meylan and Howard, 1994a,b) is described in the following document prepared for the U.S. Environmental Protection Agency (OPPT): Upgrade of PCGEMS Water Solubility Estimation Method (May 1994). A companion document (Validation of Water Solubility Estimation Methods Using Log Kow for Application in PCGEMS & EPI) also discusses the methodology. A journal article that describes the methodology is also available (Meylan et al., 1996).

WSKOWWIN uses equations 19 and 20 from these documents because they are the best available equations for estimating Wsol:
Equation 19 is: log S (mol/L) = 0.796 - 0.854 log Kow - 0.00728 MW + Corrections
Equation 20 is: log S (mol/L) = 0.693 - 0.96 log Kow - 0.0092(Tm-25) - 0.00314 MW + Corrections
where MW is molecular weight, Tm is melting point (MP) in deg C (used only for solids). Corrections are applied to 15 structure types (eg. alcohols, acids, selected phenols, nitros, amines, alkyl pyridines, amino acids, PAHS, multi-nitrogen types, etc).

Equation 20 is used when a measured MP is available (like in the case of study, EC 945-883-1); otherwise, equation 19 is used. These equations were derived from a dataset consisting of 1450 compounds with measured log Kow, water sol, and MP. Eq 20 has the following statistical accuracy: correlation coefficient (r2) = 0.97, standard deviation = 0.409 log units, and absolute mean error = 0.313 log units. Application to a validation dataset of 817 compounds gave the following statistical accuracy: correlation coefficient (r2) = 0.902, standard deviation = 0.615 log units, and absolute mean error = 0.480 log units.

WSKOWWIN estimates a log Kow for every SMILES notation by using the estimation engine from the KOWWIN Program (SRC, 2000). WSKOWWIN also automatically retrieves experimental log Kow values from a database containing more than 13200 organic compounds with reliably measured values. When a SMILES structure matches a database structure (via an exact atom-to-atom connection match), the experimental log Kow value is retrieved and used to predict Wsol rather than the estimated value.

WSKOWWIN v1.4 includes an experimental water solubility database of 6230 compounds. When experimental data are available for the SMILES being estimated, the data are retrieved and shown in the Results Window.
A complete description of the estimation methodology used by WSKOWWIN is available in two documents prepared for the U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics (Meylan and Howard, 1994a,b). A journal article that describes the methodology is also available (Meylan et al., 1996).  The WSKOWWIN program estimates the water solubility of an organic compound using the compounds log octanol-water partition coefficient (log Kow). A brief description is given below.

Data Collection
A database of more than 8400 compounds with reliably measured log Kow values had already been compiled from available sources.  Most experimental values were taken from a "star-list" compilation of Hansch and Leo (1985) that had already been critically evaluated (see also Hansch et al, 1995) or an extensive compilation by Sangster (1993) that includes many "recommended" values based upon critical evaluation.  Other log Kow values were taken from sources located through the Environmental Fate Data Base (EFDB) system (Howard et al, 1982, 1986).  A few values were taken from Section 4a, 8d, and 8e submissions the to U.S. EPA under the Toxic Substances Control Act (see http://www.syrres.com/esc/tscats_info.htm).

Water solubilities were collected from the AQUASOL dATAbASETM of the University of Arizona (Yalkowsky and Dannenfelser, 1990), Syracuse Research Corporation's PHYSPROP© Database (SRC,1994), and sources located through the Environmental Fate Data Base (EFDB) system (Howard et al, 1982, 1986).  Water solubilities were primarily constrained to the 20-25oC temperature range with 25oC being preferred.

Melting points were collected from sources such as AQUASOL dATAbASETM,  PHYSPROP©, and EDFB as well as the Handbook of Chemistry and Physics (Lide, 1990) and the Aldrich Catalog (Aldrich, 1992).
Regression & Results

A dataset of 1450 compounds (941 solids, 509 liquids) having reliably measured water solubility, log Kow and melting point was used as the training set for developing the new estimation algorithms for water solubility.  Standard linear regressions were used to fit  water solubility (as log S) with log Kow, melting point and molecular weight.

Residual errors from the initial regression fit were examined for compounds sharing common structural features with relatively consistent errors.  On that basis, 12 compound classes were initially identified and added to the regression to comprise a multi-linear regression including log Kow, melting point and/or molecular weight plus 12 correction factors.  Each correction factor is counted a maximum of once per structure [if applicable], no matter how many times the applicable fragment occurs.  For example, the nitro factor in 1,4-dinitrobenzene is counted just once.  A compound either contains a correction factor or it doesn't; therefore, the matrix for the multi-linear regression contained either a 0 or 1 for each correction factor. Appendix E describes the correction factors and coefficients used by WSKOWWIN.

WSKOWWIN estimates water solubility for any compound with one of two possible equations.  The equations are equations 19 and 20 from Meylan and Howard (1994a) or equations 11 and 12 from the journal article (Meylan et al., 1996).  The equations are:

     log S (mol/L)  =  0.796 - 0.854 log Kow - 0.00728 MW + ΣCorrections

    log S (mol/L)  =  0.693 - 0.96 log Kow - 0.0092(Tm-25) - 0.00314 MW + ΣCorrections

(where MW is molecular weight, Tm is melting point (MP) in deg C [used only for solids]) ... Summation of Corrections (ΣCorrections) are applied as described in Appendix E.   When a measured MP is available, that equation is used; otherwise, the equation with just MW is used.


The WATERNT program and estimation methodology were developed at Syracuse Research Corporation for the US Environmental Protection Agency (described inthe document Preliminary Report: Water Solubility Estimation by Base Compound Modification (Sept 1995). The estimation methodology is based upon a "fragment constant" method very similar to the method of the KOWWIN Program which estimates octanol-water partition coefficients. A journal article by Meylan and Howard (1995) describes the KOWWIN program methodology.

In a "fragment constant" method, a structure is divided into fragments (atom or larger functional groups) and coefficient values of each fragment or group are summed together to yield the solubility estimate. We call WATERNT’s methodology the Atom/Fragment Contribution (AFC) method. Coefficients for individual fragments and groups in WATERNT were derived by multiple regression of 1000 reliably measured water solubility values.

To estimate water solubility, WATERNT initially separates a molecule into distinct atom/fragments. In general, each non-hydrogen atom (e.g. carbon, nitrogen, oxygen, sulfur, etc.) in a structure is a "core" for a fragment; the exact fragment is determined by what is connected to the atom. Several functional groups are treated as core "atoms"; these include carbonyl (C=O), thiocarbonyl (C=S), nitro (-NO2), nitrate (ONO2), cyano (-C/N), and isothiocyanate (-N=C=S). Connections to each core "atom" are either general or specific; specific connections take precedence over general connections. For example, aromatic carbon, aromatic oxygen and aromatic sulfur atoms have nothing but general connections; i.e., the fragment is the same no matter what is connected to the atom. In contrast, there are 5 aromatic nitrogen fragments: (a) in a five-member ring, (b) in a six-member ring, (c) if the nitrogen is an oxide-type {i.e. pyridine oxide}, (d) if the nitrogen has a fused ring location {i.e. indolizine}, and (e) if the nitrogen has a +5 valence {i.e. N-methyl pyridinium iodide}; since the oxide-type is most specific, it takes precedence over the other four. The aliphatic carbon atom is another example; it does not matter what is connected to -CH3, -CH2-, or -CH< , the fragment is the same; however, an aliphatic carbon with no hydrogens has two possible fragments: (a) if there are four single bonds with 3 or more carbon connections and (b) any other not meeting the first criteria.

It became apparent, for various types of structures, that water solubility estimates made from atom/fragment values alone could or needed to be improved by inclusion of substructures larger or more complex than "atoms"; hence, correction factors were added to the AFC method. The term "correction factor" is appropriate because their values are derived from the differences between the water solubility estimates from atoms alone and the measured water solubility values. The correction factors have two main groupings: first, factors involving aromatic ring substituent positions and second, miscellaneous factors. In general, the correction factors are values for various steric interactions, hydrogen-bondings, and effects from polar functional substructures. Individual correction factors were selected through a tedious process of correlating the differences (between solubility estimates from atom/fragments alone and measured solubility values) with common substructures.

Results of two successive multiple regressions (first for atom/fragments and second for correction factors) yield the following general equation for estimating water solubility of any organic compound:

log WatSol (moles/L) = Σ(fi * ni) + Σ(cj * nj) + 0.24922
(n = 1128, correlation coef (r2) = 0.940, standard deviation = 0.537, avg deviation = 0.355)
where Σ(fi * ni) is the summation of fi (the coefficient for each atom/fragment) times ni (the number of times the atom/fragment occurs in the structure) and Σ(cj * nj) is the summation of cj (the coefficient for each correction factor) times nj (the number of times the correction factor is applied in the molecule).

Additionally, WATERNT automatically retrieves experimental water solubility values from a database containing more than 6200 organic compounds with measured values. When a SMILES structure matches a database structure (via an exact atom-to-atom connection match), the experimental water solubility value is retrieved and shown in the Results Window.


 - Defined domain of applicability:
(see point 5 below)

- Appropriate measures of goodness-of-fit and robustness and predictivity:
WSKOWWIN
The regression equations used by the WSKOWWIN program were trained with a dataset of 1450 compounds. WSKOWWIN estimates water solubility with one of two possible equations.  When an experimental melting point is available, WSKOWWIN applies the equation containing both a melting point and the molecular weight (MW) parameters.  In the absence of a melting point, the equation containing just the molecular weight is used to make the estimate.  All compounds in the 1450 compound training set have known melting points or are known to be liquids at 25oC.  The accuracy statistics for the two equations are as follows:

Melt Pt + MW MW only
r2 0.970 0.934
std deviation 0.409 0.585
avg deviation 0.313 0.442

Validation

The WSKOWWIN estimation equations were initially validated on two datasets of compounds that were not included in the model training.  A relatively small dataset was tested that consisted of 85 compounds having experimental log Kow values, but no available melting points.  Many compounds in the 85 compound test set decompose before melting and would theoretically have very high melting points (e.g. amino acids and compounds having multiple nitrogens).  The accuracy statistics for the equation used by WSKOWWIN are:

number 85
r2 0.865
std deviation 0.961
avg deviation 0.714
 
A much larger dataset of 817 compounds was also tested.  All 817 compounds had experimental melting points, but none of the 817 compounds had a reliable experimental log Kow.  The log Kow values used for the validation-testing were estimated (primarily using the KOWWIN program available at that time); therefore, the water solubility estimates are based on estimates for log Kow.  Typically, estimates based on estimates reduce estimation accuracy, but this type of validation can provide insight into the ability of the method.  The accuracy statistics for this dataset are:

number 817
r2 0.902
std deviation 0.615
avg deviation 0.480
  
Availability of Training & Validation Datasets

The complete datasets used to train and validate the SAR equations used by the WSKOWWIN program are available in two documents prepared for the U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics (Meylan and Howard, 1994a,b).  These documents, which also detail the estimation methodology, can be downloaded from the Internet at:

http://esc.syrres.com/interkow/EpiSuiteData.htm

Substructure searchable formats of the data can be download at:
http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm


WATERNT
Training Set Data
The following figure illustrates the estimation accuracy of the current WATERNT Program training set:
n= 1128
r2= 0.940
std dev= 0.537
avg dev= 0.355

Validation Data Sets
Currently, WATERNT has been tested on a validation dataset of 4,636 compounds not included in the training set.  These 4636 compounds were collected from the PHYSPROP Database.  Various compounds having experimental water solubility values in PHYSPROP were excluded from the validation set (these included the majority of compounds that were inorganics, had measurements outside a temperature range of 10 to 40 degrees C, or were measured at specific pH values that might skew an estimation comparison).  The complete training and validation data sets are available as noted below.
 
The validation set includes a diverse selection of chemical structures that rigorously test the predictive accuracy of any model.  It contains many chemicals that are similar in structure to chemicals in the training set, but also many chemicals that are different from and structurally more complex than chemicals in the training set.  Statistical performance for estimated vs experimental log WatSol (moles/L) are as follows:
n= 4636
r2= 0.815
std dev= 1.045
avg dev= 0.796

Availability of Training and Validation Datasets

The complete training and validation data sets can be downloaded from the Internet at:
http://esc.syrres.com/interkow/EpiSuiteData.htm

Substructure searchable formats of the data can be downloaded at:
http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm


- Mechanistic interpretation:


5. APPLICABILITY DOMAIN
WSKOWWIN
Appendix E gives the number compounds in the 1450 compound training set containing each of the correction factors.  The WSKOWWIN program applies an individual correction factor only once per structure [if at all] regardless of how many instances of the applicable structural feature occur in the structure.  The minimum number of instances is zero and the maximum is one.

 

Range of water solubilities in the Training set:
Minimum  =  4 x 10-7 mg/L (octachlorodibenzo-p-dioxin)
Maximum =  completely soluble (various)

Range of Molecular Weights in the Training set:
Minimum  =  27.03 (hydrocyanic acid)
Maximum =  627.62 (hexabromobiphenyl)

Range of Log Kow values in the Training set:
Minimum  =  -3.89 (aspartic acid)
Maximum =  8.27 (decachlorobiphenyl)

Currently there is no universally accepted definition of model domain.  However, users may wish to consider the possibility that water solubility estimates are less accurate for compounds outside the MW range, water solubility range and log Kow range of the training set compounds.  It is also possible that a compound may have a functional group(s) or other structural features not represented in the training set, and for which no correction factor was developed.  These points should be taken into consideration when interpreting model results.

WATERNT
Appendix D lists (for each fragment and correction factor) the maximum number of instances of that fragment in any of the 1128 training set compounds (the minimum number of instances is of course zero, since not all compounds had every fragment).  The minimum and maximum values for molecular weight are the following:

Training Set Molecular Weights:
Minimum MW:   30.30 (formaldehyde)
Maximum MW:  627.62 (hexabromobiphenyl)
Average MW:     187.73


Training Set Water Solubility Ranges:
Minimum Solubility (mg/L):   0.0000004  (octachlorodibenzo-p-dioxin)
Minimum Solubility (log moles/L):  -12.0605  (octachlorodibenzo-p-dioxin)
Maximum Solubility (mg/L):  miscible  (various)
Maximum Solubility (log moles/L):  1.3561  (acetaldehyde)
 
Currently there is no universally accepted definition of model domain.  However, users may wish to consider the possibility that water solubility estimates are less accurate for compounds outside the MW range of the training set compounds, and/or that have more instances of a given fragment than the maximum for all training set compounds.  It is also possible that a compound may have a functional group(s) or other structural features not represented in the training set, and for which no fragment coefficient was developed.  These points should be taken into consideration when interpreting model results.
 
6. ADEQUACY OF THE RESULT
The molecular weight of the substance is above the range of molecular weight in the training sets of both models. However, the calcualted values agree with the expectation of very low solubility, as observed experimentally and are cosidered as valid for supporting the limit solubility value obtained at the laboratory.
Reason / purpose for cross-reference:
(Q)SAR model reporting (QMRF)
Reason / purpose for cross-reference:
other: QPRF constituent #7
Guideline:
other:
Version / remarks:
REACH Guidance on QSARs R.6
Principles of method if other than guideline:
WSKowWIN:
Meylan, W.M. and P.H. Howard.    1994a.    Upgrade of PCGEMS Water Solubility Estimation Method (May 1994 Draft).  prepared for Robert S. Boethling, U.S. Environmental Protection Agency, Office of Pollution Prevention and Toxics, Washington, DC;  prepared by Syracuse Research Corporation, Environmental Science Center, Syracuse, NY 13210.

WATERNT:
Meylan, W.M. and P.H. Howard, 1995 Atom/fragment contribution method for estimating octanol-water partition coefficients. J. Pharm. Sci. 84: 83-92
GLP compliance:
no
Specific details on test material used for the study:
SMILES : O=C(CC(C)CC(C)(C)C)OCC(COC(=O)CC(C)CC(C)(C)C)(COC(=O)CC(C)CC(C)(C)C)COCC(COC(=O)CC(C)CC(C)(C)C)(COC(=O)CC(C)CC(C)(C)C)COC(=O)CC(C)CC(C)(C)(C)
Water solubility:
< 0 mg/L
Conc. based on:
test mat.
Remarks on result:
other: QSAR predicted value

WSKOWIN predicted that the constituent DPE999999 has a water solubility of 3.78E-017 mg/L

WATERNT predicted that the constituent DPE999999 has a water solubility of 1.09E-006 mg/L

Conclusions:
Water solubility of DPE999999 is predicted to be <0.00001 mg/L

Description of key information

Water solubility at 20 ºC is < 0.1 mg/L.

Key value for chemical safety assessment

Water solubility:
0.1 mg/L
at the temperature of:
20 °C

Additional information