A. Aim of the module

The aim of the module “Filling Data Gap” is to give access to three different data gap filling tools:

  • Read-across

  • Trend analysis

  • (Q)SAR models

Read-across and trend analysis use the available experimental data in the data matrix to fill a data gap. “(Q)SAR models” gives access to a library of external (Q)SAR models which have been integrated into the Toolbox.

Depending on the situation, the most relevant data gap mechanism should be chosen, taking into account the following considerations:


   • Read-across is the appropriate data-gap filling method for “qualitative” endpoints like skin sensitization or mutagenicity for which a limited number of results are possible (e.g. positive, negative, equivocal). Furthermore read-across is

     recommended for “quantitative endpoints” (e.g., 96h-LC50 for fish) if only a low number of analogues with experimental results are identified.

   • Trend analysis is the appropriate data-gap filling method for “quantitative endpoints” (e.g., 96h-LC50 for fish) if a high number of analogues with experimental results are identified.

   • “(Q)SAR models” can be used to fill a data gap if no adequate analogues are found for a target chemical.

When selecting read-across or trend analysis, the available data in the data matrix is used for filling a data gap. The user can further reduce the data set by using the profilers to eliminate chemicals which have different profiles compared to the target chemical.

It can be distinguished between two situations:

  • If a specific mechanism or mode of action relevant for the endpoint is identified for the target chemical, then all the analogues considered should have the same mechanism or mode of action.

  • If no specific mechanism or mode of action relevant for the endpoint is identified for the target chemical, then none of the structural analogues considered should have specific mechanisms or modes of actions either.

Categorical type data such as data for skin sensitization or Ames mutagenicity endpoints are calculated using read across or QSAR methods.

B. Data Gap Filling procedure

As explained in Category definition section, the Toolbox has identified chemicals which have similar structural functionality as the target chemical and for which experimental results are available. The workflow illustration for the example chemical 4- nitrobenzoyl chloride (CAS No 122-04-3) is presented below. Since the sensitization is a “qualitative” endpoint the data gap can be filled by read-across.

After highlighting the cell (1) in the matrix corresponding to the data gap to be filled, the user has to select the data gap filling methods (2) and then click on the Apply (3) button. (Figure 1)

Figure 1. Data Gap filling procedure

Before entering data gap filling window the possible data inconsistency window appears (1) (Figure 2)

Figure 2

This feature alerts the user for possible data inconsistencies.

In the example illustrated by Figure 3, there are two fields: Assay (2) and Endpoint (3) placed below the Type of method (In vivo) filed (1) (highlighted cell) (Figure 3). Data included in these two metadata fields (Assay and Endpoint) are mixed up in data gap filling.

Figure 3

The user could filter the data-points that enter the gap filling module. In order to accelerate the work the user could use the Select all/Unselect all button from the popup-menu (Figure 4)

Figure 4

More detailed information about scales is presented in the About section. (See section Options/Unit/Edit scale definition)

Note: Only one scale/unit is allowed in data gap filling.

The number located on the bottom of the window (e.g. Selected 9/9) means that 9 data points from a total of 9 data points will enter the data gap filling module.

C. Data Gap Filling window

After clicking the OK button the Data Gap Filling module starts. The next three snapshots illustrate the different types of gap filling methods: Read across window:

Trend analysis window:

(Q)SAR models window:

D. Common features in three gap filling methods

The three different gap filling approaches, while different, have common features that all share.

  • Panels over the graph

  • Color legend

  • Menus of data gap filling

  • Right click functionality

  1. Panels over graph

    Descriptors panel

The “Descriptors” panel, shown in figure 5 with pop-up menu expanded. Here the user can select the descriptor that he/she likes to see as the X axis of the graphic. The Y is the value of the data-points most commonly set in logarithmic scale. The units in Y-axis could be changed in Options, subsection Units. The descriptors panel is available for Read-across, Trend analysis and (Q)SAR models data gap filling approaches. To pick a descriptor for the X axis the user has to select it (1) and then from the pop-up menu (2) click on the “Make active descriptor” (3) item.

Figure 5

Another three options are, available in the pop-up menu shown on Figure 5, are:

     • Collect data: if the user wants to use a custom descriptor which is not calculated previously (all available databases in Toolbox are previously indexed and have their 2D and 3D calculations cached), he/she should click on Collect data in order to use this custom descriptor

     • Change descriptor units: when this option is evoked the following window appears (Figure 6):

Figure 6

Then the user could change the dimension of the selected descriptor. Also different conversions are allowed.

     • Edit descriptor option – this pop-up window allows the user to select different type of calculation for a selected descriptor. This setting is developed especially for purposes of tautomeric set prediction. (Figure 7)

Figure 7

Note: The user is allowed to set more than one descriptor in Y-axis for the purposes of the read-across method only.

   Prediction panel

The “Predictions” panel, shown on Figure 8, is the same for the three data gap filling approaches with only some slight differences – for read-across it highlights the neighbor data-points with respect to descriptor used in Y-axis (by default 5 nearest which can be changed in Prediction approach options), for trend analysis it displays the trend line. For all approaches this panel shows the distribution of data-points, and the predicted value of the target chemical, on a graph. At the bottom of the panel the user has a dropdown list with descriptors (1) that can be used for the X axis, but this is for visualization purposes only, pre-calculation is getting active when the descriptor is changed in Descriptor panel. By default logKow descriptor is used (2)

Figure 8

    Adequacy panel

The Adequacy panel, Figure 9, houses the adequacy graph. It is a graph where on the one axis there is the observed value and the other is the predicted value. Also Coefficient of determination (R2) and Adjusted Coefficient of determination (R2adj) (1) are displayed on the top of the graph.

Figure 9

      Cumulative frequency panel

The cumulative frequency is the frequency with which the value of the residual (EP.obs – EP.calc) is less than or equal to a reference residual value.

For example, if cumulative frequency is 70% at residual value of 0.2, then 70% of training set members have residuals less or equal to 0.2. Below is the snapshot with cum. frequency graph. (Figure 10)


     Statistic panel

This panel shows the statistical characteristics of regression equation (Figure 11). The upper part of the panel includes statistics for regression equation (1), while in the second part coefficients included in the regression equation are shown (2).

Figure 11

    Residuals panel

This panel includes graph on which the distribution of residuals for a given endpoint versus a specified descriptor is illustrated. (Figure 12)

Figure 12

   2. Color Legend

The resulting graph plots the existing experimental results of all analogues (Y axis) according to a specified descriptor (X axis). The default descriptor is log Kow.

      • Read –across: The dark red dots (1) on the graph represent the experimental data available for the analogues and which are used for the read-across. The blue dots (2) on the graph represent the experimental data available for the

        analogues, which are not used for the read-across as they are further away from the target chemical on the X-axis. By default the five nearest analogues with respect to logKow are used in read-across calculation. (This is optional. See

        Calculation options/Prediction approach options). The red dot (3) represents the estimated result for the target chemical based on the read-across from the analogues. (Figure 13)

Figure 13

     • Trend analysis: The blue dots (1) on the graph represent the experimental data available for the chemicals in the category which are used in regression equation. The red dot (2) represents the calculated data for the target chemical

       based on regression calculation from the analogues in the category. The observed value for the target chemical, if available, is colored in orange (3). (Figure 14)

Figure 14

    • QSAR model – The blue dots (1) on the graph represent the experimental data for analogues chemicals, the little blue triangles (2) represent the observed data for training set chemicals of the model (if available), the little blue square

      (3) represent the observed value for test chemicals (if available). (Figure 15)

Figure 15

More details for color legend are displayed below. (Figure 16)

Figure 16

The Chart legend could be evoked from the Information submenu in the menus portion of dap-filling window and clicking on the Show legend. (See section Information/Show legend)

   3. Menus of the data gap filling

On the Figure 17 menus of the data gap filling are shown.

Figure 17 – menus of data gap filling

Below is a short description of each menu.

       3.1. Select/filter data (Figure 18)

Figure 18 – Select/filter data options

          • Subcategorize - Sub-categorization is one of the most powerful tools available to the user. It provides the features to refine the broader category into a more consistent group, more pertinent set of chemicals for the user to derive a

             prediction from using the chemical’s properties. (Figure 19)

Figure 19

In the particular example illustrated for CAS 122043 (Figure 20), all results of the analogues are positive. The same sensitizing potential is therefore also predicted for the target chemical. By default, the Toolbox averages the result of the 5 “nearest” analogues with respect to log Kow (as defined by the X-axis descriptor) to estimate the result for the target chemical. The user can then verify the mechanistic robustness of the analogue approach.

Figure 20

This can be verified by opening Select / Filter Data menu and re-profiling the list of identified analogues by clicking on Subcategorize (1) and choosing Protein Binding by OASIS (2). The properties of the analogues (3) are then compared with the properties of the target chemical (4). In this particular example there is one analogue which has different protein binding mechanism than of the target chemical. It is colored in green on the graph and in data matrix (5)

(Figure 21)

Figure 21

The user can eliminate analogues which have dissimilar mechanism of action than the target chemical by clicking on the Remove button. The left part of the Subcategorization panel includes all profiling/categorizing methods which can be used for evaluating and refining the category. More detailed information for the endpoint and related subcategorization method is illustrated in Categorization section in the Background information for some grouping methods table. Along with the profiling methods all observed metabolism and simulated transformation tables are included in the Subcategorization panel. In order to apply some of the available simulators (abiotic/biotic) the user has to select the desired profiling method (1) and click on the related simulator (2). (Figure 22)

Figure 22

For the purposes of prediction of skin sensitization the following simulators could be used: Skin senstitization and Autoxidation simulators.

The following simulators are related to prediction of genotoxicity: Rat liver metabolism simulator and Rat liver S9 metabolism simulator.

When a simulator is included in the profiling process the software starts to metabolize the chemicals in the category and applies the selected profile to the package of the target and its metabolites. This process could be time consuming as the Databases are not previously metabolized. The profiling results of a package, target and metabolites, are displayed on the Target panel of Subcategorization window (1). (Figure 23)

The user has to click on Do not account for metabolism in order to turn off metabolizing and profiling of metabolites. (2) (Figure 23)

Figure 23

Note: Keep in mind that all Databases and Inventories are not metabolized.

The right panel of Subcategorization window consists of two parts Target and Analogues. Both panels include profiling results of target and analogues. The Target panel includes profiling results (1) across selected profiling method (2) for the Target and its metabolites if available (3). (Figure 24)

Figure 24

The Analogues panel includes profiling results for analogue chemicals. Analogues having different profiling results than those of the target chemical are selected (highlighted by) with a blue background. (1) The number in brackets (2) indicates the number of analogues in the respective category (in the example above there are 9 analogues which have no alert across Protein binding). The user can eliminate dissimilar analogues by clicking on the Remove button (3). (Figure 25)

Figure 25

Right click over a category invokes the pop-up menu with some secondary functions (1) (Figure 26)

Figure 26

There are two options to differentiate analogues from the target chemical regarding profiling results (Figure 27):
   • Analogues having at least one profiling category different from those of the target (1)
   • Analogues having all categories simultaneously different from those of the target (2)

First option means that the software will search for analogues having at least one of the (profiling results) categories associated to target chemical. Second point means that software will search for analogues having all categories (profiling results) simultaneously.

Figure 27

Adjust options

Adjust option button is used for changing settings of the available groupers, for instance the Structure similarity. The Structure similarity was developed for identifying chemicals (analogues) based on different levels of structural similarity between the target and analogue chemicals. When the user selects Structure similarity (1) module the software distributes all analogues in bins (2) of similarity (Figure 28). In this study case (CAS 122-04-3) all analogues in the identified category are distributed in three bins of similarity (Figure 28)

Figure 28

The user could see which chemicals are displayed in each of the bins, when double click over the current bin (1) to see the list of chemicals included in the bin (2). More details for similarity of chemicals falling into a given bin could be displayed with a right click over the chemical(s) and selecting the Explain menu item (3) (Figure 29)

Figure 29

A window displaying details on the similarity between two compared structures appears (Figure 30)

On the top of the window is the percentage of similarity between structures (1). Molecular structure of the target and compared structure is next (2). Features (fragments) (3) used in determining of similarity between two chemical are displayed. Green color (4) is an indication for common feature for the two compared molecules, while the red is an indication for difference (5) (Figure 30)

Figure 30

Settings of similarity

A list of settings for similarity calculation is available with a click on the Adjust option button. On Figure 31 the areas of the Similarity option window are illustrated:

Figure 31

1 – Methods for calculation of similarity;
2 – Panel displaying formula for calculation of similarity;
3 – Graphical description of overlapping of features of similarity;
4 – Molecular features used in calculation;
5 – Options related to corresponding molecular feature;
6 – Short description of selected molecular feature;
7 – Calculation options;
8 – Atom characteristics used in calculation of similarity;
9 – Example illustrated selected method and features of similarity

For the purposes of comparing two chemicals, the user has to double click over the chemical structure (1) and draw or paste the SMILES of his target structure (2) into the 2D Editor panel. (Figure 32)

Figure 32

This has to be done for the other query chemical too (1). Then the user has to select the desired calculation settings (2) (Figure 33)

Figure 33

        • Mark chemical by WS (Figure 34)

Figure 34

This function compares data values, which are in the volume concentration unit family, against their calculated maximum water solubility in order to detect experimental errors. When this button is selected (1) the user has to select one of the available methods (2) and click OK (3) The chemicals are then selected (colored in green) (4). (Figure 35)

Figure 35

      • Mark chemicals by descriptor values (Figure 36)

Figure 36

This function is used to select chemicals that fall within a range of a particular parameter. The user has to click on Mark chemicals by descriptor value (1) select one of the available descriptors (2), define the desired range and then click OK (3). Finally chemical(s) that fall in the defined range are selected (colored in green) (4) (Figure 37)

Figure 37

     • Filter point by test conditions (Figure 38)

Figure 38

This feature provides the user with capabilities to remove some of the data-points based on their metadata. After selecting Filter by test condition button (1) a new Data Filter (2) window appears. Here the user could select the desired metadata field (3) used it for filtering and remove the dissimilar data (4). (Figure 39)

Note: this functionality removes experimental data for a given chemical.

Figure 39

        • Mark focused chemical (Figure 40)

Figure 40

This functionality Mark focused chemical (1) marks (in green) the data-points of the currently focused chemical (2). (Figure 41)

Figure 41

       • Mark focused points (Figure 42)

Figure 42

This functionality (1) marks (in green) the currently focused data point(s) (2). (Figure 43)

Figure 43

       • Remove marked chemicals/points (Figure 44)

Figure 44

This function (1) removes marked (colored in green.) chemical/data-points (2) (Figure 45)

Figure 45

       • Clear existing marks (Figure 46)

Figure 46

This function clears the markings of chemicals/data-points.

  3.2. Selection navigation (Figure 47)

Figure 47

This functionality applies the following actions:

• Go back – undo one change.

• Go forward – redo one change.

• Go to first – go to initial state.

• Go to last – go to final state.

 3.3. Gap filling approaches (Figure 48)

Figure 48

This functionality allows the user to switch between gap filling modules.

• Read-across – switch to read-across.

• Trend analysis – switch to trend analysis.

3.4. Descriptors/Data (Figure 49)

Figure 49

This functionality allows the user to (in the descriptors panel):
  • Make active descriptor – make the selected descriptor active.
  • Remove active descriptor – deactivate a descriptor in use.
  • Collect data – collect data for the selected available descriptor.
  • Change descriptor units… - change the unit of the selected available descriptor.
  • Edit descriptor options… - change the calculation parameter used in gap filling approach. (Figure 50)

Figure 50

   3.5. Model/(Q)SAR (Figure 51)

Figure 51

This functionality allows to:

  • Save model – save the model. A form will be displayed in which the user should fill in all information pertinent to the model (Figure 52).

Figure 52

The user is allowed to fill in fields such as: Model name, other related model etc. Also fields in the other panels: General info, Endpoint, etc. can be filled in or edited (Figure 53):

Figure 53

  • Save domain as category –save the domain as a category. You will be prompted to select a profiler to which to add the category. If you plan to use this feature you would need to create a custom profiler to serve as storage for the categories. This profiler/grouping method could be used for the categorization purposes.

  • Save JRC XML QMRF – save model as a XML file.

  • Calculate Q2 – The program allows manual calculation of Q2 parameter for categories containing more than 50 analogues. This is due to the calculation of Q2 for categories consisting of more than 50 analogues being quite time consuming and is thus not performed automatically.

   3.6. Calculation option (Figure 54)

Figure 54

     • Data usage – this sets the way the Toolbox handles multiple data-points per single chemical. (Figure 55)

Figure 55

   • Prediction approach options – for read-across sets the way the prediction is approximated – minimal, maximal, average, median, lower median, higher median, mode, lowest mode and highest mode. For trend analysis it sets the approximation type – averaging, linear and quadratic. (Figure 56)

Figure 56

   • Use target data for prediction - this functionality allows the user to include observed data of target if available in the gap filling calculation

   • Set level of significance – set the confidence level and standard deviation.

  3.7. Visual options (Figure 57)

Figure 57

  • Set units in figure title – set the visualization options. (Figure 58)

Figure 58

   • Set axes ranges – allows manual setting of X and Y axis ranges. (Figure 59)

Figure 59

   • Show all members of chemical sets – this function was developed for sets of chemicals. Clicking this button will show/hide the members of chemical sets. All chemicals for a given set are visualized (e.g all tautomers in a tautomeric set) (Figure 60) (e.g light blue colored dots represent three tautomers for a given chemical).

Figure 60

• Show confidence range – show/hide confidence range in the prediction panel. The inside range shows confidence range of regression equation (1), while the outside range shows confidence range of individual prediction(2) (Figure 61)

Figure 61

   • Show intercorrelations – shows the inter-correlations panel. When this button is clicked the user has to select the descriptors (1) for X and Y (2) axis used in the correlation in the Intercorr. window. (Figure 62)

Figure 62

   3.8. Information (Figure 63)

Figure 63

  • Focused details – show additional details about the selected data-point’s chemical:

   This window includes Chem ID information (1), structure of chemical (2), panel with calculated descriptors (3), panel with profiling results across selected descriptors (4) and (5) panel with recalculated data.( Figure 64)

Figure 64

Double clicking over the Endpoint obs. Data (recalculated) will display a window with experimental data (Figure 65)

Figure 65

  • Target details – show additional information about the target chemical.

  • Differences to target – this shows the differences between the selected data-point (1) and target chemical with respect to all available Profiling methods. The profiling method(s) for which there are some differences are colored in orange (2) (Figure 66)

Figure 66

• All points within a region – sometimes there are chemicals with same logKow values in a category of two or more chemicals, presented as a dots, to be one behind the other. This function allows the user to see all the chemicals (dots) within one region. The user has to click All points within a region (1), then to drag the mouse (left mouse button) in order to specify the rectangular region (2). As a result a window with details for all chemicals appears (3), selecting a specific point number (4) displays information for the selected chemical (5). (Figure 67)

Figure 67

   • Show legend – show/hide the chart’s legend.(1) (Figure 68)

Figure 68

   3.9. Miscellaneous (Figure 69)

Figure 69

  • Print chemicals – print the chemical list.

  • Save chemicals to SMI – save chemicals to SMI file. This is tab delimited file with CAS, Name and SMILES of structures

  • Copy picture – copy the prediction panel’s chart to clipboard.

4. Right click functionality

Another common feature for the three gap filling approaches is right click menus. All menus for gap filling section mentioned above are available by right click (1) on the graph. (Figure 70)

Figure 70

E. Specific features in gap filling approaches

1. Read-across

As mention in section Aim of the module Read-across is the appropriate data-gap filling method for “qualitative” endpoints like skin sensitization or mutagenicity for which a nominal/ordinal scales are used. Furthermore read-across is recommended for “quantitative endpoints” (e.g., 96h-LC50 for fish) if only a low number of analogues with experimental results are identified. In the read-across method all common features mentioned in Section Common feature are available, with exception of the panels placed above the Graph. Only the Prediction panel and Descriptor panel are available. (Figure 71)

Figure 71

In contrast to trend analysis the read-across method can use more than one descriptor. The count of the neighbors is 5 by default, but it can be changed from the calculation options. The distance between the target and each analogue is the Euclidian distance in multidimensional space defined by active descriptors and normalized by the range of values for any descriptor. For example, if read-across is made by two descriptors (logKow and MW) the distance in the “logKow” and “MW” axes will be:
LogKow distance (target, analogue) = (logKow (target) – logKow (analogue) / LogKow range
MW distance (target, analogue) = (MW (target) – MW (analogue) / MW range. The Euclidian distance is square root from sum of squares of distances for each axis.

2. Trend analysis

As mention in section Aim of the module Trend analysis is the appropriate data-gap filling method for “quantitative endpoints” (e.g., 96h-LC50 for fish) if a high number of analogues with experimental results are identified. Furthermore read-across is recommended for “quantitative endpoints” (e.g., 96h-LC50 for fish) if only a low number of analogues with experimental results are identified. Here all common features displayed in Section Common feature are available.

There are features available for trend-analysis only:
 • Mark outlier points
 • Usage of more than one active descriptor
• Usage of “qualitative” data Mark outlier points (Figure 72)

Figure 72

This function is usable to mark those chemicals which do not fit the confidence range (Figure 73).

Figure 73

Usage of more than one active descriptor: In the trend analysis only one active descriptor is allowed.

Usage of “qualitative” data: It is not allowed to the user to switch to or to enter trend analysis using “qualitative” endpoints. Only “quantitative” endpoints are allowed in trend-analysis.

3. QSAR models

As mentioned in the Section Aim of the module “(Q)SAR models” gives access to a library of external (Q)SAR models which have been integrated into the Toolbox and can be used to evaluate the robustness of a category or to fill a data gap if no adequate analogues are found for a target chemical.

  3.1. Library with QSAR models

A library with all available QSAR models is listed (Show only relevant (1) and Show estimated DB (2) are unchecked). (Figure 73)

Figure 74

If these checkboxes are ticked only the models related to the specific nodes of endpoint tree are available. For example if the node Ecotoxicity>>Actinopergyii (fish) (1) is selected and the two aforementioned boxes are ticked (2) only the ECOSAR (USEPA) (3) model is available in the relevant QSAR models panel (4). The other models that are associated with more specific endpoint nodes are positioned in the panel QSAR models in nodes below (5). (Figure 75)

Figure 75

The models placed in the panel QSAR models in nodes below will be available when the user select the node to which the model is assigned. For instance when the user expands the Actinopergyii (fish) node and then the Pimephales promelas (1) the models related to this fish will be available (2). (Figure 76)

Figure 76

3.2. Ranking of QSAR models

Comparison of results between models related to a given endpoint is possible using the ranking functionality. This could be done when the user clicks the Rank models button (1), then the Models ranking window appears (2). (Figure77)

Figure 77

Detailed information for Models ranking window is given on Figure 78 below:

Figure 78

Managing of visualization and ranking of fields in the table is possible using the popup menu’s Select Descriptor menu-item (1), then QSAR descriptors window appears (1) Figure 79

Figure 79

In the displayed QSAR descriptors panel the user could change visualization of fields with a double click on the cell of the current field and changing it to YES or NO. (1) For example Title field could be visible or not if YES or NO is set in the cell corresponding to this field. (Figure 80)

Figure 80

Reordering (Up or Down) and using of specific model field is possible when double click and change the current status of the field (2) (Figure 80)

When the settings of ranking are fixed, then the models are ranked in the Relevant QSAR library window For example ranked by Title (1) or ranked by availability in domain (2) (Figure 81)

Figure 81

  3.3. Applying QSAR model

The model can be used to evaluate a category or a single chemical by applying it to all the chemicals in the category and analyzing the results. To apply the model simultaneously to all the chemicals:

   • in the category, select the model (1), right-click on it and select “Predict Endpoint” (2) and “All chemicals” (3). (Figure 82)

   • in the domain of the model, select the model (1), right-click on it and select “Predict Endpoint” (2) and “All chemicals in domain” (3). (Figure 83)

   Predict All chemicals

Figure 82

   Predict Chemicals in domain

Figure 83

Right click over selected models provides the user with some additional functions (1) related to the selected QSAR model (2) (Figure 84)

Figure 84

Pop-up menus

   • Rank models – this functionality is displayed in Ranking models section

   • Sort by Date – QSAR are sorted by date when this is set

   • Model About – some information for the QSAR model is provided (Figure 85)

Figure 85

• Model Options – this function is available for those QSAR models which use descriptors which can be calculated in different ways. For example if the user selects BCF (EPISUIT) (1) model and select Calculation options (2), then the pop-up menu appears where the user could chose one way for calculating BCF parameter (3) used further in calculation of BCF (EPISUIT) model. (Figure 86)

Figure 86

• Display Domain – this function visualizes the Domain (if available) for the selected QSAR model (Figure 87)

Figure 87

In this particular example the target chemical (3) belongs to the domain (2) of model M2-LC50-Pimephales promelas (1), because it fulfills the conditions (green ticked boundaries) of the model (4).

  • Display tautomeric filter – functionality available when tautomeric filter is applied on a set of chemicals

  • Apply tautomers filter – functionality available when a QSAR model derived for tautomeric set of chemicals

  • Display QMRF – it display QMRF file for selected model (if available). (Figure 88)

Figure 88

• Display training set chemicals – it displays a separate window with the chemicals in the training set of the selected model (if available) with their available experimental data. (Figure 89)

Figure 89

• Display Test set chemical – it displays test set chemicals for a selected QSAR model (if available). (Figure 90)

Figure 90

  • Delete Model – it deletes a custom QSAR model. Only custom QSAR models are allowed to be deleted

  • Delete Predictions – it deletes predictions for selected QSAR model

  • Check Calculations – it displays a window with comparison table for Regression Statistic and Regression Equations of original model and recalculated model. (Figure 91)

Figure 91

  • Rebuild – it rebuilds the selected QSAR model. (Figure 92)

Figure 92

  3.4. Creating a new QSAR model

The Toolbox allows the user to create a new custom QSAR. The next sequence of snapshots demonstrates building a QSAR model for predicting acute toxicity to Tetrahymena pyriformis of aldehydes. For the purpose of this study, a category of analogues should be available. In our case study we are investigating target chemical with CAS 66-25-1, the category used in defining a new QSAR is Aldehydes by US-EPA, the investigated endpoint is IGC 50 48 h, Tetrahymena pyriformis. (1) The user has to click the Create New QSAR button (2) and finally click the Apply (3) button. (Figure 93)

Figure 93

After gap filling module is displayed, the descriptor on the X-axis has to be activated in order to build the regression (1). Log Kow is selected but it is not active, the user has to manually activate it. (2). (Figure 94)

Figure 94

After activating the chemical descriptor used in the equation the user has to build the model (1). (Figure 95)

Figure 95

After clicking the Build button (1) then the model is built and all analogues (dots) are colored in purple (2). (Figure 96)

Figure 96

Here the user is allowed to apply subcategorizations in order to refine the category.

Below are additional options available for QSAR modes ONLY:

   • Mark chemicals out of domain

   • Show analogues

  • Show training set

  • Show test set

  • Build model

  • Restore model

Select/Filter data

Mark chemicals out of domain – it marks those chemicals which are out of the domain of the model

Model QSAR

  Show analogues – shows analogue chemicals included in the model

  Show training set – shows chemicals included in the training set of the model if available

  Show test set – shows chemicals in the test set of the model if available

  Build model –builds the model

  Restore model –restores the model

3.5. Application of QSAR model to defined category of chemicals

Toolbox allows the user to apply the selected QSAR model to chemicals presented on data matrix. First of all the user has to click on the cell with corresponding QSAR model (1), select the QSAR model (2) and click the Apply (3) button. (Figure 97)

Figure 97

In this particular case study there are 68 analogues with 68 experimental IGC 50 data, so in the gap filling 68 analogues will be included in the category. After clicking the apply button the Possible data inconsistency will appear. Note that it will look for the scale of the applied model. In this particular case study M2 model requires log (mol/l) (1), if there is a conversion for the experimental data from mg/l (2) to log (mol/l), then all 68 data will be allowed into the gap filling. (Figure 98)

Figure 98

Now the QSAR model is applied to the 68 chemicals from data matrix. (Figure 99)

Figure 99

Now the user has to build the regression equation (1). (Figure 100)

Figure 100

Now the user is allowed to refine the category, applying the subcategorization procedure.

F. Data Gap Filling approaches using different modes of handling chemical structures

Two different modes for handling of chemical sets are defined:

  • Individual Component Mode - The target chemical, its metabolites or mixture constituents are analyzed as individual structures

  • Set mode - The chemical and its tautomers are handled as a set of structures. (Figure 101)

Figure 101

Three methodologies for estimating toxicity of set of chemicals are developed:

  • Independent mode (Dissimilar action)

  • Similar mode (Dose concentration)

  • Specific models

Both concepts (independent action and dose/concentration addition) are based on the assumption that chemicals in a mixture do not influence each other’s toxicity, i.e. they do not interact with each other at the biological target site. Such chemicals can either elicit similar responses by a common or similar mode of action, or they act independently and may have different endpoints and/or different target organs. Both concepts have been suggested as default approaches in regulatory risk assessment of chemical mixtures.

Independent action (response addition, effects addition) occurs if chemicals act independently from each other, usually through different modes of action that do not influence each other.

Dose/concentration addition (similar action, similar joint action) occurs if chemicals in a mixture act by the same mechanism/mode of action, and differ only in their potencies. In principle, doses or concentrations of the single components are added after being multiplied by a scaling factor that accounts for differences in the potency of the individual substances. The mixture dose/concentration (Dmix) is the sum of the adjusted doses/concentrations (aDi) of the individual components Di:

The effect of a mixture of similarly acting compounds is equivalent to the effects of the sum of the potency-corrected (adjusted) doses/concentrations of each compound.

Specific models – This methodology has the aim to use QSAR models developed on a basis of set of chemicals (mixtures) for purposes of mixture toxicity prediction. This section is under development.

Based these methodologies for handling set of chemicals, different ways for handling of tautomers, mixtures and metabolites are developed.

Note: In case the gap filling is entered with set of chemicals with undefined quantities of the components equimolar quantities for all components are assumed for the gap filling calculations.

1. Tautomeric set prediction

In comparison to TB 2.3 where tautomers are not handled, TB 3.0 handles tautomers as part of structure multiplication of parent chemical and all tautomers of the target chemical are analyzed in a single package (set mode)

Below is illustrated a procedure of data gap filling using tautomeric set for prediction of:

  1.1. Skin sensitization endpoint for chemical with CAS 577-71-9

    1.1.2. Enter chemical via CAS (1) and select tautomeric set chemical (2). Then software searches the selected databases for tautomeric set of the entered chemical. (Databases are already tautomerized, calculated and profiled in tautomeric mode). Click Ok button. (Figure 102)

Figure 102

   1.1.3. Profile chemical set applying Protein binding OASIS and OECD scheme (1). The profiling results are displayed for the set of tautomers. (Figure 103)

Figure 103

     1.1.4. Gather data from Skin sensitization database as it is the skin sensitization endpoint that is being investigated. (Figure 104)

Figure 104

   1.1.5. Define category (2) of similar analogues using Protein binding by OASIS scheme (1). As it can be seen on the snapshot below, all profiling results for all chemicals in the set are listed. So the software will search for chemical sets from the selected Skin sensitization database which correspond to the profiling results of the target set. Click Ok (3). (Figure 105)

Figure 105

The software identifies 4 tautomeric sets which answer the criteria of the target set (1). (Figure 106)

Figure 106

   1.1.6. Read-across is applied (Figure 107)

Figure 107

The tautomeric sets of analogs have the same distribution of Protein binding alerts as the target set (1), so the prediction could be accepted (2). (Figure108)

Figure 108

1.2. Aquatic toxicity (LC 50, 96h, Pimephales promelas) endpoint for chemical with CAS 89-62-3

   1.2.1. Input CAS 89-62-3 (1), select tautomeric set (2), click OK (3). (Figure 109)

Figure 109

  1.2.2. Select Aquatic OASIS database (1) and gather experimental data (2). (Figure 110)

Figure 110

  1.2.3. Define category by US-EPA (1), there are two categories in the Target profiles panel: aldehydes and not categorized, we are removing the Not categorized category (2). So the software will search for tautomeric sets having Aldehydes category by US-EPA. (Figure 111)

Figure 111

  1.2.4. Apply trend analysis to the defined category for LC 50 96h, Pimephales promelas (1). (Figure 112)

Figure 112

  1.2.5. There is a new functionality in TB for visualization of all tautomers in a tautomeric set. Open Visual options (1) and select Show all members of chemical sets (2). (Figure 113)

Figure 113

All members of the tautomeric set appear (1). (Figure 114)

Figure 114

1.2.6. The following subcategorization procedure is applied to refine the category

   • Remove chemicals having LC 50 more than their WS using WS fragments

   • US-EPA New Chemical categories

  • Chemical elements

  • ECOSAR The next snapshot illustrates a prediction of LC 50 after subcategorization procedure – 56.3 mg/l. (1). (Figure 115)

Figure 115

In the TB 3.0 a new functionality Apply as filter is developed (1). (Figure 116)

Figure 116

Apply as filter – this is developed especially for set of chemicals (mixture, tautomers). It is used to filter out chemicals from the chemical set (tautomers, mixture) with respect to the selected subcategorization profiling scheme (e.g. If the US-EPA is selected and then Apply as filter is checked the selected subcategorization profile is applied to the each of the tautomers within the tautomeric set. If this option is not selected the selected subcategorization is applied to the tautomeric set as a unit). This is illustrated on the snapshots below. (Figure 117)

Figure 117

As it can be seen the number of categories (1) in the analogues sets is different, when the Apply as filter option (2) is used.

  1.2.7. The final subcategorization scheme used in subcategorization procedure is Tautomers unstable.

The Unstable tautomer profiler is developed on the basis of available measured data and theoretical calculations for tautomer forms in water and gas phase for the training set chemicals. The unstable tautomeric forms are presented as individual categories. The aim of this profiler is to identify unstable tautomeric forms within each of tautomeric set.The user has to select Tautomers unstable (1), check Apply as filter (2) in order to filter tautomers within tautomeric sets. The user has to select the unstable categories (3) using Ctrl button and, finally, click the Remove button (4). The chemicals having unstable tautomeric forms will be removed from each tautomeric set included in the category. (Figure 118)

Figure 118

  1.2.8. Finally accept the prediction (1). (Figure 119)

Figure 119


  1. For toxic effects conditioned by cell signaling networks (such as skin sensitization, genetic toxicity, etc.) highly reactive tautomers appear to be responsible for the observed toxicity.

  Recommendation: use the complete tautomeric representations of the chemicals

  2. For toxic effects conditioned by less specific interactions (such as mortality, growth inhibition, immobilization, etc.) the stable tautomeric forms appear to be the dominant toxicants. Databases and Inventories usually contain the most stable tautomeric form.

  Recommendation: use the most stable tautomers for representation of the chemicals

2. Quantitative mixtures toxicity prediction

Defined mixtures are handled as part of the structure multiplication of parent chemical. Three new options for prediction of mixtures are available based on the mode of action of the constituents:

  • Acting Independently (with different mode of action)

  • Acting Similarly (with same mode of action)

  • Acting Specifically (specific models for predicting toxicity of mixtures could be applied)

Below is illustrated a procedure for assessing mixture toxicity investigation for the following two endpoints:

  • Aquatic toxicity

  • Skin sensitization

The following mixture with defined quantities is used in the predicting of the two endpoints mention above. (Figure 120)

Figure 120

  2.1. Predicting aquatic toxicity of mixtures:

    2.1.1. Input of mixture. There is a feature developed in TB 3.0 to define quanitities of component of the mixture. Once the constituents of the mixtures is pasted or drawn (1) in the 2D editor, a specific button allowing input of quantities appears (2). The quantities (3) of each of the components along with units (4) are added manually. (Figure 121)

Figure 121

After defining the quantities they will appear in the panel with molecular structure.

   2.1.2. Profiling components of the mixture – the user has to switch to Individual mode (1) and select relevant profilers (2). In our case study we are selecting ECOSAR, MOA of action, US-EPA. Then profile the mixture. As it can be seen, all components have same mode (3). (Figure 122)

Figure 122

  2.1.3. Gathering experimental data – In this particular case related to aquatic toxicity, the user has to select Aquatic databases (1) and to gather experimental data (2). Available experimental data appears on the datmatrix (3). (Figure 123)

Figure 123

  2.1.4. Gap filling approach using Similar mode of action. In this particular case study Similar mode is applied in for calculation purposes based on the fact that the investigated mixture has defined quantities and all component have same mode of action. The user has to click on a cell related to LC 50, 96h, Pimephales promelas (1) for the mixture, select Similar mode (2) and click the Apply button (3). (Figure 124)

Figure 124

The prediction result (1) accounts for quantities of each component and uses dose concentration calculation (2) for prediction of LC 50. (Figure 125)

Figure 125

2.2. Predicting Skin sensitization

For two of the constituents there is experimental data for skin sensitization (SS). For the third of the mixture constituents there is no exp. data for SS so Read Across will be applied. Then, all of the available data, experimental and predicted, will be used for SS prediction of the mixture. (Figure 126)

Figure 126

Below is illustrated a procedure for Read-across prediction for one of the components:

   2.2.1. Focus on constituent without experimental data (1). It then appears in a new data matrix (2). (Figure 127)

Figure 127

   2.2.2. Define category by Protein binding by OASIS (1) –Analogs are selected with same Protein binding mechanism (2)

Figure 128

  2.2.3. Read-across for skin sensitization endpoint. We are selecting the cell corresponding to In vivo (1), mixing all endpoints and assays (2). (Figure 129)

Figure 129

Almost all analogs have been found to be positive. Predicted SS effect of the target is positive (1). (Figure 130)

Figure 130

Based on the prediction for the constituent (1) without experimental data and two other constituents with experimental data (2) the read-across prediction for mixture could be performed. (Figure 131)

Figure 131

  2.2.4. Read across is applied for the mixture (assuming Independent Mode of Action) (1). “Maximal” approximation type (2) is set by default for read across of categorical endpoints. (Figure 132)

Figure 132

Note: TB 3.0 uses mixtures with defined quantities. In case there quantities of the components are not defined then they are concerned as equimolar.

3. Prediction accounting for metabolism

In TB 2.3 metabolism could be used in Profiling and Subcategorization only, while in TB 3.0 generated metabolites can be used as representatives of the target chemical. Data Gap Filling can be applied to selected metabolites and predictions transferred to parent chemical.

Also the metabolites along with the target could be assumed a set of chemicals and predictions to be applied for the set.

Below is an illustration of a read-across prediction for skin sensitization applied to a selected metabolite for chemical “trans-2,cis-6-nonadienol” (CAS# 28069-72-9).

In the scheme bellow there are no alerts for the parent chemical so an investigation of the metabolites of target chemical can be performed:

Our target chemical has no protein binding alert; however it has six metabolites which have some alerting group responsible for protein interaction. Gap filling procedure applied to a selected metabolite and transfer of prediction to a parent chemical:

   3.1. Entering target (CAS# 28069-72-9) chemical by CAS number. (Figure 133)

Figure 133

  3.2. Multiplication target chemical via skin metabolism simulator – the user has to right click over the chemical (1) and select Multiplication>> Metabolism/Transformations>>Skin metabolism simulator (2) (Figure 134)

Figure 134

Then all the metabolites appear in tree like form. (Figure 135)

Figure 135

  3.3. In the Profiling step the user could apply Protein binding profilers being relevant to skin sensitization endpoint to all metabolites as a package or in the individual mode (Figure 136)

Figure 136

3.4. Next step in the workflow is to gather experimental dataSkin sensitization is selected (1), click Gather (2). As it can be seen there is a positive experimental data for the parent chemical. (Figure 137)

Figure 137

However the parent chemical has no alert found. This leads us to investigate the one of the metabolite who has an active alert found. So in the next step we are investigating metabolite # 1

3.5. In order to investigate the specific metabolite, the user has to focus on metabolite # 1, this can be done by right clicking over the metabolite #1 (1) and clicking on the focus menu item (2). (Figure 138) “Focus” functionality allows the selected metabolite to be used as a representative of the target chemical.

Figure 138

Then the focused chemical is opened in a new datamatrix (1) and will be used for further read-across analysis

3.6. Analogue search for similar analogues uses USEPA new Chemical categories scheme (1). The software identifies 57 Analogues (2). (Figure 139)

Figure 139

3.7. Apply read-across (1) method for predicting skin sensitization, mixing all assays and endpoints (2). (Figure 140)

Figure 140

3.8. The following subcategorization procedure is applied:

     o Protein binding by OASIS 

         o Protein binding by OECD

             o Protein binding potency

Below is the read-across analysis after subcategorization procedure (1). (Figure 141)

Figure 141

3.9. The user has to accept prediction (1) and return to data matrix (2) in order to continue with transferring the prediction of metabolite to the target chemical. (Figure 142)

Figure 142

Before returning to datamatrix a series of messages appear.

Note: The appearance of these messages is optional and is governed by the General options/Reports section.

The first message informs the user that the model is still not saved (1) and invites the user to save the model. If the Yes button is clicked then an Edit model window appears and invites the user to fill in the fields (2). If the No button is clicked then the software will not save the model. (Figure 143)

Figure 143

The next message asks the user to specify the profilers relevant to the investigated endpoint. These selected profilers will appear in the report. If the Yes button is clicked the window with all profilers appears where the user can select the desired profilers (2). By default there are some profilers selected (this is optional and default options could be changed in General Options/Report). If the No button is clicked then the profilers selected by default will appear in the report. (Figure 144)

Figure 144

The next message (1) asks the user to if he/she wants to collect additional data for the analogues from data matrix for reporting purposes. If the Yes button (2) is clicked a window with Endpoint tree nodes will appear and the user could specify the node for which the experimental data will be reported. If the No button is clicked the default experimental sets by will be reported (This is optional and could be changed in Options/Reports). (Figure 145)

Figure 145

3.10. Finally click Return to datamatrix.

3.11. In order to transfer the prediction of metabolite to the parent chemical the user has to return to the parent chemical matrix. Return to Input (1) and click over the first node in the current document with parent chemical (2). Now the datamatrix of the parent chemical is displayed. (Figure 146)

Figure 146

3.12. Go to Data Gap Filling and select Independent mode (1) (This mode is allowed for “quantitative” and “qualitative” endpoints, while the Similar mode is related to “quantitative” endpoint ONLY ) and click Apply (2) (Figure 147)

Figure 147

3.13. Accept the prediction. Now the prediction of metabolite is transferred to the parent chemical (1). (Figure 148)

Figure 148

G. Right click menus

Right click over the cell with accepted values provides the user several options. (Figure 149)

Figure 149

Copy – copies the text from the highlighted cell

Explain – explains information in the highlighted cell formatted in table (same action is achieved with a double click over the cell). (Figure 150)

Figure 150

Delete prediction – deletes the accepted prediction

Display prediction domain – displays the domain of the prediction

Explain prediction – prediction explanation (2) obtained from BfR skin irritation/corrosion QSAR (3) model in table form (3). (Figure 151)

Figure 151

Edit prediction info – allows the user to fill in or edit fields which appears in the Toolbox report

Report – generate report for selected prediction if there are multiple available.

IUCLID5 – export prediction via i5z files or via Web Services (see IUCLID export)