Registration Dossier

Data platform availability banner - registered substances factsheets

Please be aware that this old REACH registration data factsheet is no longer maintained; it remains frozen as of 19th May 2023.

The new ECHA CHEM database has been released by ECHA, and it now contains all REACH registration data. There are more details on the transition of ECHA's published data to ECHA CHEM here.

Diss Factsheets

Physical & Chemical properties

Partition coefficient

Currently viewing:

Administrative data

Endpoint:
partition coefficient
Type of information:
(Q)SAR
Adequacy of study:
key study
Study period:
2017-03-16
Reliability:
2 (reliable with restrictions)
Rationale for reliability incl. deficiencies:
results derived from a valid (Q)SAR model and falling into its applicability domain, with adequate and reliable documentation / justification
Justification for type of information:
1. SOFTWARE
ACD/Percepta 14.0.0 (Build 2726. 27 Nov 2014)

2. MODEL (incl. version number)
ACD/Percepta 14.0.0 (Build 2726. 27 Nov 2014)

3. SMILES OR OTHER IDENTIFIERS USED AS INPUT FOR THE MODEL
Smiles: CC(C)(C)OC(=O)N1CC(O)CC1C(O)=O

4. SCIENTIFIC VALIDITY OF THE (Q)SAR MODEL
[Explain how the model fulfils the OECD principles for (Q)SAR model validation. Consider attaching the QMRF or providing a link]
- Defined endpoint:
log Kow (log P) – The logarithm of a ratio of concentrations of un-ionized compound between its solutions in n-octanol and water: LogKo/w
The dataset used to develop the reported model has been compiled from a great number of different sources covering a wide variety of experimental protocols used to determine log Ko/w values reported within them. This includes the classical potentiometric log Ko/w determination methods involving phase titrations, as well as more contemporary and most modern chromatographic methods like HPLC on standard and modified (immobilized artificial membrane (IAM) and liposome chromatography) resins or capillary electrophoresis and centrifugal partition chromatography. Since log Ko/w takes into account only partition of neutral species, when the method involves only single data point measurement (i.e. the log Ko/w is not determined by extrapolation from a pH dependence curve), the water phase is usually buffered to a pH in which the predominant state of the analyzed compound is neutral. For a comprehensive overview of the experimentallog Ko/w measurement techniques please see [1].
log Ko/w is a relatively easily measured property. As a result the experimental data quality, which is usually inversely proportional to the complexity of the experiment, is reasonably good. Independent external studies show that the error between the logKo/w measurements performed by different laboratories using the same protocol (reproducibility) can be expected to be within 0.5 logarithmic units [2].

Experimental data from various sources have been used. The characteristics of the entire dataset compiled for the task of this model development is:
No. of compounds = 16277
Min. Value = -5.08
Max. Value = 11.29
Std. Dev. = 1.92
Skewness = 0.22

- Unambiguous algorithm:
Global linear baseline QSAR + local similarity based corrections The global QSAR was developed using PLS in combination with bootstrapping technique. This method implies random compound sampling
from the initial training set, i.e. generation of new “training sub-sets”.

Each of the sampled sub-sets is of the same size as the initial training set, however, random manner of their population results in some compounds being selected more than once, others being omitted. This procedure is performed 100 times and an independent PLS model is derived for every sub-set.
Each of those PLS models is based on 2D fragmental descriptors:

log Ko/w = SUM[i=1..n](ai*fi)+ c

where fi is the number of occurences of the i-th fragment in a molecule, ai - its statistical coefficient, and c - intercept.

As a result, each global QSAR model actually represents an ensemble of 100 PLS models, providing each compound with a vector of 100 log Ko/w predictions, each based on a slightly different sub-set of the initial training set. It is defined that two compounds with similar trends in the variation patterns of the 100 value vectors predicted by a global QSAR model are considered similar in terms of the analyzed property, i.e. the differences in the compound sets used to parameterize each of 100 PLS models, constituting a baseline model, affect estimations for the two compounds in a similar way. The correlation coefficient of the two vectors is called an Individual Similarity Index between two compouds (SIi). An analogous definition of the “property-specific” or dynamic similarity was first used by Tetko and his co-workers [3-7] and this method has been recently used in the analysis of the acute toxicity data [8].

With the available robust similarity measure, it becomes possible to analyse the performance of the baseline QSAR model in the local chemical environment of a query molecule represented by the most similar compounds in the training set. In case any systematic errors are encountered for sufficiently similar compounds, a local correction (Δ) is calculated.
Later on it is possible to train the model quickly and efficiently using new experimental data by just adding it to this second similarity correction calculation procedure, without the time costly baseline model re-training.
Descriptors in the model:
Fragmental descriptors dimentionless (occurence count) Fixed set of fragmental descriptors, based on the expanded list of Platt's type fragments (see [9]). A fixed and relatively small set of fragments was used due to the specifics of the employed modeling methodology. In order for the correlation between two compound vectors of log Ko/w predictions coming from a baseline QSAR model to be representative of compound similarity in terms of the analyzed property, these vectors have to be parameterized using exactly the same set of fragmental descriptors. This prevents the use of any sort of automated fragmentation routines (atom based, isolating carbon based, chain based, etc.) that result in a dynamic set of fragments depending on the training set structures. They leave the possibility that for any query structure from outside the training set the same rules will yield certain new fragments not encountered in the training set molecules which is not compatible with the main condition just mentioned. On the other hand, it is equally important for the model to be able to identify any new structural features of a query molecule that were not present in the training set compounds. I.e., the fixed fragment set cannot be constructed based on the analysis of the training set either, or in general any molecule set whatsoever. Because in that case any new structural features not present in that database would be eventually ignored. As a result, the fragmental descriptor set is based on the general knowledge and considerations regarding all possible chemical structures rather than a finite dataset and include all the fragments, even those that are not detected in the training set molecules at all.

Descriptor selection:
The last fact mentioned in Section 4.3 also excludes the possibility to employ any of the usual descriptor selection techniques relying on the generation of a large initial pool of various descriptors and its subsequent reduction during the statistical analysis (exclusion of statistically insignificant, intercorrelated variables, etc.). Such an analysis by definition would have to be based on a certain dataset, and would not allow having “blank” fragments in the final variable set.

Algorithm and descriptor generation:
The generation of the descriptor matrix following the outlined approach constituted counting the occurences of any of the pre-defined fragments in the trainig set molecules. This procedure as well as all the subsequent statistical analysis were performed using Algorithm Builder 1.8 software.

Software name and version for descriptor generation:
Algorithm Builder 1.8
ACD/Labs, Inc. 110 Yonge Street, 14th floor, Toronto, Ontario, Canada M5C 1T4.
http://www.acdlabs.com

Chemicals/Descriptors ratio:
30.2 (11387 chemicals in the training set, 377 descriptors)

- Defined domain of applicability:
Applicability domain of the model is defined based on the training set compounds. This procedure takes into account the following two aspects:
* Similarity of the tested compound to the training set. No reliable predictions can be made if we have no similar compounds in the training set;
* Consistence of the experimental values with regard to the baseline model for similar compounds. Even if we do have similar compounds in the dataset the quality of prediction could be lower if that data cannot be reproduced by the baseline model. It does not matter what the reason for this inconsistency – experimental variability or sudden change in mechanism of action because of slight structural changes – in any case it indicates possible problems when trying to give accurate predictions

Method used to assess the applicability domain:
The two aspects mentioned above receive their quantitative assessment in terms of Similarity Index (SI) and Data- Model Consistency Index (DMCI). The SI, evaluating how distant the query structure is from the whole training set, is calculated by weighted averaging of all the individual Similarity Indices (S/i) for the test molecule and each of the 5 most similar compounds from the training set. DMCI is calculated by comparing the differences between experimental and global QSAR predicted values for the 5 most similar compounds and the suggested similarity correction value (Δ) for the test compound, calculated by averaging these differences. The more individual differences are scattered around the calculated average (Δ), the more inconsistent are the data for the similar compounds with regards to the global QSAR model.
The final prediction Reliability Index is calculated as a product of the aforementioned two indices:
RI = SI * DMCI
Both SI and DMCI are scaled to vary from 0 to 1, so the resulting RI also varies in this range. Lower values suggest a compound being further from the Model Applicability Domain and the prediction less reliable (low SI or low DMCI either alone or in combination can be the reason). On the other hand, high RI values indicate an increasing confidence about the quality of the prediction (both SI and DMCI have to be high to yield such a result).

Limits of applicability:
Reliability Index < 0.3

- Appropriate measures of goodness-of-fit and robustness and predictivity:
The statistics of the training set data:
No. of compounds = 11387
Min. Value = -5.08
Max. Value = 11.29
Std. Dev. = 1.94
Skewness = 0.25

Statistics provided for the fraction of the training set that falls within the aplicability domain of the model (RI > 0.3 - see Section 5.4)
NRI>0.3 = 11371 (i.e. 99.9% of the training set compounds)
R2 = 0.944
Std. Dev. = 0.457
RMSE = 0.457
F = 402696.2 (Fisher's F-statistics)

The statistics of the validation set data:
No. of compounds = 4890
Min. Value = -4.64
Max. Value = 10.89
Std. Dev. = 1.90
Skewness = 0.16

Random splitting of the initial dataset into the training and validation sets using the ratio 70%:30%.

Statistics provided for the fraction of the validation set that falls within the aplicability domain of the model (RI > 0.3 - see Section 5.4)
NRI>0.3 = 4872 (i.e. 99.6% of all the validation set compounds)
R2 = 0.940
Std. Dev. = 0.464
RMSE = 0.464
F = 165247.5 (Fisher's F-statistics)

Analysis of the subsets of the higher quality results
NRI>0.5 = 4772 (i.e. 97.6% of all the validation set compounds)
R2 = 0.945 Std. Dev. = 0.444
RMSE = 0.444 F = 177716.6 (Fisher's F-statistics)
NRI>0.75 = 3345 (i.e. 68.4% of all the validation set
compounds)
R2 = 0.964 Std. Dev. = 0.360 RMSE = 0.360
F = 197041.9 (Fisher's F-statistics)

- Mechanistic interpretation:
Mechanistic basis of the model:
The only mechanistic consideration utilized in model building is the use of a linear regression method (PLS) and the fragmental descriptors. In other words it is assumed that the final predicted value is composed of a linear combination of all the contributions of structural moieties making up the test molecule. Although very basic, this consideration is one of the most fundamental ones, even the name of (Q)SAR methods implies that the main determinant of all the properties of a compound is its structure. Quite obviously fragments are the best and realy firsthand descriptors of a chemical structure.

A priori or a posteriori mechanistic interpretation:
A posteriori model interpretation results are consistent with generaly understood mechanistic factors or scientific interpretations and well documented experimental facts. I.e., the top ten fragmental descriptors with negative coefficients are the following:
Any positive permanent charge = -2.436
Quaternary ammonium = -1.612
Permanent charge on aromatic N, O, S, Se = -1.317
Sulfonic acid = -1.125
alpha-Amino acid = -0.965
N-oxide = -0.674
tertiary amine (>N-) = -0.673
=S< = -0.670
Any phosphorus atom = -0.573
Lactone = -0..404
Some of those fragments are very well known because of their effect of increasing hydrophilicity of a compound. One more classical example of such water phase favorable group, i.e., hydroxy fragment, follows this TOP10 almost immediately with a statistical coefficient of -0.400
Among the groups with the largest positive coefficients, the absolute majority of them can be clearly expected to increase the hydrophobic properties of a compound, e.g.:
Bicyclo [3.1.1] scaffold = 1.103
Spiro [5.2] scaffold = 1.066
Any Si atom = 0.714
Spiro [6.6] = 0.678
Spiro [6.5] = 0.644
Fused 6:5:5 scaffold = 0.614
Stereohindrance in the form of two bulk branched aliphatic substituents in both orto- positions of a phenolic group = 0.460
n-Pentyl chain = 0.452
n-Heptyl chain = 0.442
Aromatic sulphur =0.419
Note: the average of all 377 statistical coefficients is 0.018
All the fragments encoding strong H-bonding in the aromatic system (e.g., orto-keto, orto-thioketo, orto-nitro, or orto-halogenated phenols and anilines - 6 descriptors in total) have positive coefficients which is in agreement with the known fact that H-Bonding reduces hydrophilicity.
The coefficients of 6 fragments mentioned range from +0.005 to +0.455 with an average of +0.15.
Further similar examples can be established as well.

[1]Avdeev, A., Absorption and Drug Development: Solubility, Permeability, and Charge State, John Wiley & Sons, Inc., Hoboken, NJ, 2003.
[2]Kishi, H. and Hashimoto, Y., Evaluation of the procedures for the measurement of water solubility and n-octanol/water partition coefficient of chemicals results of a ring test in Japan, Chemosphere, 1989, 18, 1749- 1759.
[3]I.V. Tetko, Neural network studies. 4. Introduction to associative neural networks, J. Chem. Inf. Comput. Sci. 2002, 42, 717-728.
[4]I.V. Tetko and P. Bruneau, Application of ALOGPS to predict 1-octanol/water distribution coefficients, logP, and logD, of AstraZeneca inhouse database, J. Pharm. Sci. 2004, 93, 3103-3110.
[5]I.V. Tetko and V.Y. Tanchuk, Application of associative neural networks for prediction of lipophilicity in ALOGPS 2.1 program, J. Chem. Inf. Comput. Sci. 2002, 42, 1136-1145.
[6]H. Zhu, A. Tropsha, D. Fourches, A. Varnek, E. Papa, P. Gramatica, T. Oberg, P. Dao, A. Cherkasov, and I.V. Tetko, Combinatorial QSAR modeling of chemical toxicants tested against Tetrahymena pyriformis, J. Chem. Inf. Model. 2008, 48, 766-784.
[7]I.V. Tetko, I. Sushko, A.K. Pandey, H. Zhu, A. Tropsha, E. Papa, T. Oberg, R. Todeschini, D. Fourches, and A. Varnek, Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection, J. Chem. Inf. Model. 2008, 48, 1733-1746.
[8]Sazonovas, A., Japertas, P., and Didziapetris, R., Estimation of reliability of predictions and model applicability domain evaluation in the analysis of acute toxicity (LD50), SAR QSAR Environ. Res. 2010, 21, 127-148.
[9]J.A. Platts, D. Butina, M.H. Abraham, and A. Hersey, Estimation of molecular linear free energy relation descriptors using a group contribution approach, J. Chem. Inf. Comput. Sci. 1999, 39, 835-845.


5. APPLICABILITY DOMAIN
The reliability Index for the prediction is above 0.3 (RI=0.36) indicating that the substance is in the applicability domain

6. ADEQUACY OF THE RESULT
The substance fits in the applicability domain of the model. The prediction is valid and can be used for classification and risk assessment.

Data source

Reference
Reference Type:
study report
Title:
Unnamed
Year:
2017
Report date:
2017

Materials and methods

Principles of method if other than guideline:
- Justification of QSAR prediction: see field 'Justification for type of information'
GLP compliance:
no
Type of method:
calculation method (fragments)
Partition coefficient type:
octanol-water

Test material

Constituent 1
Chemical structure
Reference substance name:
(2S,4R)-1-[(tert-butoxy)carbonyl]-4-hydroxypyrrolidine-2-carboxylic acid
EC Number:
604-011-7
Cas Number:
13726-69-7
Molecular formula:
C10H17NO5
IUPAC Name:
(2S,4R)-1-[(tert-butoxy)carbonyl]-4-hydroxypyrrolidine-2-carboxylic acid

Results and discussion

Partition coefficient
Key result
Type:
log Pow
Partition coefficient:
0.74
Remarks on result:
other: QSAR

Applicant's summary and conclusion