Is Sklearn an open source

›Demo-PY4: Predictive maintenance with scikit-learn: A decision tree for the classification of failures for an automobile data set

Demo-PY4: Predictive maintenance with scikit-learn shows how a failure forecast as part of a predictive maintenance scenario with the help of the decision tree method of the Python libraryscikit-learn for machine learning is carried out. We use the same automotive data set that is used in all of the predictive maintenance cycle demos: a csv file automotive_data.csv with 136 observations. Each observation or line has a total of 24 columns, of which 22 columns contain sensor values ​​and represent the characteristics of the data analysis. The target variable is the "Failure" column. The question is: "With which combination of characteristics does a failure occur?".


In the demonstrators of the predictive maintenance cycle Demo 2: What is Predictive Maintenance ?, Demo 2: Data analysis with RapidMiner, Demo 3: Predictive Maintenance with R, Demo 5: Predictive Maintenance with MATLAB, the implementation of the data analysis and prediction is carried out as part of the foresighted Maintenance performed using various data analysis tools: RapidMiner, R, MATLAB. In this demo of the predictive maintenance cycle, we use the functions of the Python library scikit-learn, as graphviz for the visualization of the decision tree, and set as development and runtime environment Jupyter Notebook a.

Why scikit-learn?

scikit-learn is an open source machine learning library for the Python programming language and offers powerful algorithms such as classification, regression and clustering algorithms, see the sklearn functions. In this section, the library scikit-learn will play a central role, as it provides the binary classification methods we need: decision trees, see sklearn.tree.

Why graphviz?

Graphviz is an open source program package for the visualization of graphs, which is implemented in scikit-learn by means of the function Functionexport_graphviz ()is used for the visualization of decision trees. The generated visualization of the decision tree provides additional information for each node, such as: number of observations, number of features, entropy, value of the target variable.


Demo-PY4 is divided into 5 sections. First, the automobile data set is described and the question that we want to answer with our data analysis. Then we briefly describe how decision trees work and how they are used. Then the creation of the Jupyter Notebook and preparation of the required libraries are explained. In the following sections we describe the creation of a decision tree prediction model, its visualization with graphviz, and the determination of performance indicators that describe the quality of the model.

  1. The automobile record

  2. What is a decision tree?

  3. Create Jupyter Notebook

  4. The decision tree predictive model

  5. Performance indicators

The automobile record

The automobile data set is a csv file and consists of a total of 136 observations from 7 engines. The semicolon is used as a separator for the columns.

Each observation contains the forecast value failure, with assumed values ​​yes / no, the measurement number, and sensor values ​​for a total of 22 characteristics. Only 16 measurements are available for the 7th engine. The collected characteristic values ​​originate from temperature and pressure measurements as well as quantitative information on fuel and exhaust fumes that were recorded at various points in the engine. The data record also contains eight features of the lambda probes, each of which delivers measured values ​​to one cylinder in the engine and is divided into 2 banks (i.e. the lambda probe32 feature specifies the sensor value from the third cylinder within the second bank). With the exception of the catalyst temperature (feature name: catalyst temperature category, values: normal / high), numerical measurements are available for all features.

What is a decision tree?

A decision tree is a simple and intuitively usable binary classification model that can be used to answer a question. In our case, the question of whether a certain combination of measurements fails or not. A decision tree consists of a root, child nodes and leaves, with each node representing a decision rule and each leaf representing an answer to the question. To read a classification of an individual data object, you go down from the root node along the tree. A feature is queried for each node and a decision is made about the selection of the following node. This continues until you reach a leaf. The sheet corresponds to the classification.

The details of making a prediction using a decision tree are inDemo2 - data analysisdescribed.

Create Jupyter Notebook

For data management, the creation of the decision tree prediction model and the prediction itself, a Jupyter Notebook used. We open the Jupyter Notebook application via programs and create a new Python3 notebook with the name elab2go-Demo-PY4 using the "New" menu item.

The details of using Jupyter Notebooks are in the section Use Jupyter Notebooksdescribed. Further prerequisites such as the basics of the Python programming language and the introduction to the Python library Pandas are covered in Demo-PY1: Python tutorial and in Demo-PY2: data management with pandas described in more detail.

The decision tree predictive model

Next, a decision tree prediction model is created in 6 code cells, on which we can determine the correct sequence of steps (1. Import the required program libraries, 2. Read data from CSV file, 3. Extract characteristics and target variables, 4. Divide the data into Training and test data, 5. Create a prediction model, 6. Visualize the prediction model) test.

1. Import the required program libraries

In the first code cell of the Jupyter notebook we import the required program libraries: numpy, pandas, and the package from sklearn metrics and the functions DecisionTreeClassifier, export_graphviz. numpy is required for storing the data in arrays, pandas for data preparation. The sklearn package metrics contains functions with which the quality of a prediction model can be assessed.

In Python you can use the import statement either import a complete program library, or only individual functions of the program library (from-import statement). During the import, alias names are assigned to the respective libraries or functions: the alias np is assigned for numpy, the pd alias is assigned for pandas.

2. Read in data

The second code cell contains data that are in the csv file automotive_data.csv are saved using the function read_csv () read from the Pandas library, see also Demo-PY2: data management with pandas.

  • Line 1: The training data set is in the same folder as our script, so the name of the file does not have to contain any path information.
  • Line 3: Generate a list of column numbers. With the help of the numpy function arange, the values ​​[0,1,2, ... 21] are generated: these are those columns of the csv file that are to be imported into a Pandas DataFrame and used as characteristic columns.
  • Line 4: Read CSV file into a Pandas DataFrame. The read_csv () function receives the name of the file to be read as the first parameter. The other parameters are options that control the import.
    header = 0 means that the first line of the CSV file contains column headings.
    index_col = 0 means that the first column will be the pandas index column that contains the row headings.
    sep = ";" means that a semicolon is used as a separator.
  • Lines 5-7: Output part of the data for checking.

3. Extract features and target variable

In the third code cell, the feature values ​​are initially with iloc extracted into a new DataFrame and then using the Pandas method to_numpy () converted to a NumPy array x. The "Failure" column is then extracted as a target variable into an array y.

Python codeoutput

4. Split the data into training and test data

In the fourth code cell, the data stored in the arrays x and y are divided into a training data set and a test data set. The training data set is used to build the predictive model and the test data set is used to validate the model. The method train_test_split () receives the arrays x and y as input parameters and returns four arrays:
X_train: training data (only characteristics), X_test: test data (only characteristics)
y_train: training data (target variable), y_test: test data (target variable)
The parameter test_size controls the size of the test data set, here: 30%.

Python codeoutput

5. Create a predictive model

The prediction model for the training data (X_train, y_train) is created using the method fit() of scikit-learn DecisionTreeClassifier created. The DecisionTreeClassifier uses an optimized version of the CART algorithm. When the model is created as an instance of the DecisionTreeClassifier class, various configuration parameters are set that control how precisely the decision tree is built.

  • criterion: defines which function is used to measure the quality of a split at a node. Either "gini" or "entropy" can be used as a criterion. We choose the "entropy" criterion.
  • splinter: describes the strategy used to split the next node.
    -- splitter = 'random' means that a random feature is chosen to split the node.
    -- splitter = 'best' means that the most relevant feature is selected to split the node.
  • min_samples_split: determines the minimum number of observations required to split a node. Can be specified as an absolute number (smallest value: 1, largest value: number of observations) or in percent (smallest value: 0.1, largest value: 1.0). We set min_samples_split = 0.3, i.e. a node is only split if it contains more than 30% of the data records.
  • max_features: indicates the number of features that will be considered when looking for the best division. The smallest value for max_features is 1, the largest value is the number of features (in our case, 22).
    Informal interpretation: Assume that the training data set consists of 100 observations / lines and we set max_feature = 10. Then 10 features are randomly selected for each split (ie division of the data at a node and from these 10 the best feature for the split is determined. Too large a value for max_features leads to an overfitting of the model to the training data. We set max_features = 10.
  • max_depth: determines the maximum depth of the decision tree
  • max_leaf_nodes: determines the maximum number of leaf nodes.

6. Visualize the predictive model

The decision tree is visualized with the help of the libraryGraphviz and the scikit learn function export_graphviz () created. Graphviz is an open source program package for visualizing graphs. Graphviz takes all the instructions required to generate the graphic from a text file, which contains a description of the nodes and edges of the graph, in the DOT-Description language for the visual representation of graphs.

First, in lines 2-5, the function export_graphviz () the decision tree is exported to the graphviz DOT format. A graph is then generated from this in line 7 and displayed in line 8 in the Jupyter Notebook output. Important: In order to use Graphviz, it is not enough to install the corresponding Python package; Graphviz must be installed as a separate application on your computer and known via the PATH variable.

If an error message occurs when visualizing with Graphviz, the decision tree can also be opened with the help of the sklearn. Function plot_tree () can be visualized. The function receives the previously trained decision tree (here: "model") as the first parameter and an option as the second parameter, which specifies whether the tree nodes are displayed in color (Filled = True) or in black and white (Filled = False).

Performance indicators

To validate a decision tree predictive model, one uses measures that indicate how well the value of the target variable is predicted by the corresponding model. The most important key figures are in Demo2 - Data Analysis - Key Figures described in detail. We only use Accuracy, Precision and Recall here.

  • Probability of confidence (accuracy):
    The probability that a correct prediction will be made for an observation. Accuracy = quotient between the number of correct assignments and the number of all observations.
  • Accuracy:
    The probability that the prediction of failure is correct. Precision = quotient between the correct positive classifications and the number of predictions with "failure = yes".
  • Hit rate (English recall):
    The probability that a failure is also predicted for an observation "failed". Recall = quotient between the correct positive classifications and the total number of observations with failure = "yes".

In a predictive maintenance scenario, the company incurs costs if a failure is not recognized as such, but there are also costs if a failure is incorrectly predicted and production is interrupted. That means the key figures Precision and Recall are both important.

In the following we determine performance indicators Accuracy, Precision and Recall and thus evaluate the quality of the prediction model. First, a prediction is made based on the decision tree and the test data. The predict () method receives the test data sets as parameters (30% of the data that were not used to create the model) and returns a vector y_pred with predictions: "yes" means failure, "no" means no failure.

Python codeoutput

After the prediction array y_pred has been determined, the metrics described above can be used with the help of the functions accuracy_score, precision_score, and recall_score be calculated. With the set configuration parameters (criterion = 'entropy', splitter = 'best', min_samples_split = 0.3, max_features = 10) we achieve a confidence level of 75%, an accuracy of 83% and a hit rate of 68%.

Python codeoutput

An accuracy of 83% can be assessed as "good" in the case of our mini automobile data set with only 136 observations. It means that less than 20% of the time, a failure is predicted when actually there is none.

Interactive visualization

Now that the basic functions of classification with decision trees have been tested, let's next create widgets using the Jupyter widget functioninteractive () an interactive visualization of the decision tree. The interactive user interface has control elements in the upper area that can be used to set the most important configuration parameters. Whenever a configuration parameter is changed, a new forecast model is created and visualized immediately. At the same time, the performance indicators for the validation data set are automatically calculated and displayed.

Authors, tools and sources