Merge branch 'master' of https://gitlab.cl.uni-heidelberg.de/wernicke/exp-ml-example (2b174405) · Commits · wernicke / exp-ml-2

03_übung/README.md

+56 −33

Original line number	Diff line number	Diff line
		## Table of Contents {ignore=true}
		## Table of Contents
		<!-- @import "[TOC]" {cmd="toc" depthFrom=1 depthTo=6 orderedList=false} -->
		<!-- code_chunk_output -->
		- [Table of Contents {ignore=true}](#table-of-contents-ignoretrue)
		- [About this folder 🤓](#about-this-folder)
		- [Table of Contents](#table-of-contents)
		- [About this folder 🤓](#about-this-folder-)
		- [Structure](#structure)
		- [Goals 🏆](#goals)
		- [Decision trees in max depth 🌳](#decision-trees-in-max-depth)
		- [Goals 🏆](#goals-)
		- [Decision trees in max depth 🌳](#decision-trees-in-max-depth-)
		- [What does "Bezeichne" mean?](#what-does-bezeichne-mean)
		- [Work with the classifiers](#work-with-the-classifiers)
		- [What predictions do DecisionTree and RandomForest make?](#what-predictions-do-decisiontree-and-randomforest-make)
		@@ -14,11 +14,10 @@
		- [Calculate accuracy?](#calculate-accuracy)
		- [How does the max_depth parameter change the accuracy of our DecisionTreeClassifier?](#how-does-the-max_depth-parameter-change-the-accuracy-of-our-decisiontreeclassifier)
		- [What influence does the max_depth parameter have on the accuracy in general?](#what-influence-does-the-max_depth-parameter-have-on-the-accuracy-in-general)
		- [What are the visual differences between the plots? 👓](#what-are-the-visual-differences-between-the-plots)
		- [What are the visual differences between the plots? 👓](#what-are-the-visual-differences-between-the-plots-)
		- [Why are there different accuracies for the plots?](#why-are-there-different-accuracies-for-the-plots)
		- [Exercise 7](#exercise-7)
		- [Differences between RandomForrest and DecisionTreeClassifier forrest 🌳❓](#differences-between-randomforrest-and-decisiontreeclassifier-forrest)
		- [Exercise 9 🙋‍♀️](#exercise-9-️)
		- [Differences between RandomForrest and DecisionTreeClassifier forrest 🌳❓](#differences-between-randomforrest-and-decisiontreeclassifier-forrest-)
		- [Thoughts on a possible project topic 🙋‍♀️](#thoughts-on-a-possible-project-topic-️)
		- [What do we want to do for our project?](#what-do-we-want-to-do-for-our-project)
		- [Should the use a decision tree for our project?](#should-the-use-a-decision-tree-for-our-project)

		@@ -67,10 +66,10 @@ Predicitions for Dataset 3 vs. Gold Labels:
		[0 1 1 1 1 1 1 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 1 1 0 0 1 1 1 1] Test accuracy: 1.0
		[0 1 1 1 1 1 1 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 1 1 0 0 1 1 1 1]
		```
		<p align="center" width="100%">
		<img width="66%" src="/uploads/43c6e47f183836de8e01c7ec9006b24e/dec_tree1.png">
		<p><div align="center">
		<img width="40%" src="/uploads/43c6e47f183836de8e01c7ec9006b24e/dec_tree1.png">
		DecisionTree
		</p>
		</div></p>

		#### RandomForest
		Predicitions for Dataset 1 vs. Gold Labels:
		@@ -93,37 +92,59 @@ Predicitions for Dataset 3 vs. Gold Labels:
		For the calculation of the accuracy of our classifier on X_test we implemented a ```function cal_acc(true, pred)```. <br>
		With every run of the code the predictions slightly changed and therefore also the accuracy. The value for the accuracy stayed quite high for all of the three datasets, but always was the highest for the linearly separable dataset. The accuracy for testing on the train set is 100 percent in all cases.
		#### How does the max_depth parameter change the accuracy of our DecisionTreeClassifier?
		The accuracy of the training set is expected to be higher than the one of the test set, as this is the source of information the classifier trained on. While keeping the default maxdepth and all the other default parameters of the DecisionTreeClassifier, the accuracy of the taining set is 100%. However, when the maxdepth is changed to a different value you get different accuracies for each dataset - higher or lower depending on how high or low the maxdepth is set. Maxdepth is the maximum depth of a tree. If it is too high the tree is overfitting in the training process and we get low accuracy score in testing. While doing some runs with different maxdepth values, some observations can be made: values above 9 don't change the accuracy for test and training set. Below 5 the accuracy for both sets is reduced by half of the default value. Our goal is to maximize the test accuracy, not the train accuracy, but for achieving that we need a high training accuracy.<br>
		The accuracy of the training set is expected to be higher than the one of the test set, as this is the source of information the classifier trained on. While keeping the default maxdepth and all the other default parameters of the DecisionTreeClassifier, the accuracy of the training set is 100%. However, when the maxdepth is changed you get different accuracies for each dataset.<br>
		Observations:
		- max_depth too high: overfitting on training data and low testing score
		- values above 9 don't change the accuracy for test and training set
		- below 5 the accuracy for both sets is reduced by half of the default value

		Our goal is to maximize the test accuracy, not the train accuracy, but for achieving that we need a high training accuracy.<br>

		#### What influence does the max_depth parameter have on the accuracy in general?
		It is important to not set this value to low, because the tree needs to have a large enough depth to make all the predictions but it also should not be to high. If you set max_depth too high, then the decision tree might overfit the training data without capturing useful patterns. This will cause a lower testing accuracy. But if you set it too low, that is not good as well; then you might be giving the decision tree too little flexibility to capture the patterns and interactions in the training data. This will also lead to a lower testing accuracy. <br>
		If you set max_depth too high, then the decision tree might overfit the training data without capturing useful patterns. This will cause a lower testing accuracy. But if you set it too low, then you might be giving the decision tree too little flexibility to capture the patterns and interactions in the training data. This will also lead to a lower testing accuracy. <br>
		The golden spot between these two extremes needs to be found: not too high and not too low. <br>
		To find the optimal depth we wrote our own function and a function with a GridSearch. In the end we used the GridSearch method to find optimal values for max_depth. The function could easily be used to optimize the other parameters of the classifiers.
		To find the optimal depth we wrote our own function and a function with a GridSearch. In the end we used the ```GridSearch``` method to find __optimal values for max_depth__. The function could easily be used to optimize the other parameters of the classifiers.

		#### What are the visual differences between the plots? 👓
		After we finally succeeded in getting our code to run without any unexpected things happening, we finally could take a look at the plots of our datasets. In the plot of the first data set, the interlocking moons (the shape of the original data) can still be vaguely seen seen. The blue part and the red part are interrupted in some areas by the other color and all in all there are only straight decision boundaries so that each coloured area consists of multiple squares.<br>
		The second dataset with its data points aligned in circles gets plotted to one big blue square in the middle with some outliers that is surrounded by a red area. In this plot you can see that again only straight lines are used for decision boundaries. <br>
		The last dataset with the linearly separable points has decision boundaries that part the data points vertically with straight lines. The red area is disrupted by a vertical blue bar that includes the blue points in the red area but apart from that the plot parts the data into two separable areas. <br>
		The plots with the optimized parameters show that the optimum max_depth leads to a plot with more white areas of the decision boundary. <br>
		<p align="center" width="100%">
		<img width="66%" src="/uploads/249fd5297e17e18af7e1af3541a0ff80/0_plots.png">
		- interlocking moons:
		- shape of the original data can still be vaguely seen seen
		- blue and the red part are interrupted in some areas by the other colour
		- only straight decision boundaries
		- each coloured area consists of multiple squares
		- circles:
		- plotted to one big blue square in the middle with some outliers
		- surrounded by a red area
		- only straight lines are used for decision boundaries
		- linearly separable:
		- decision boundaries that part the data points vertically with straight lines
		- red area is disrupted by a vertical blue bar that includes the blue points in the red area
		- apart from that the plot parts the data into two separable areas
		- optimized plots:
		- some white areas where points of both colours are located

		<p><div align="center">
		<img width="50%" src="/uploads/249fd5297e17e18af7e1af3541a0ff80/0_plots.png">
		DecisionTree
		</p>
		The same phenomenon can be seen in the RandomForest plot. The data is plotted a lot more detailed in both scenarios with more smaller squares that build one coloured area and some lighter/white areas. For the optimized decision boundary there are again more white/lighter areas.
		<p align="center" width="100%">
		<img width="66%" src="/uploads/1b1a221a62ac71850b2c7bd606505ef0/1_plots.png">
		</div></p>

		The same phenomenon can be seen in the RandomForest plot. The data is plotted a lot more detailed in both scenarios with more smaller squares that build one coloured area and some lighter/white areas. For the optimized decision boundary there are again more white/lighter areas. This shows that there is not just the binary decision blue or red but lighter gradations in the figure.

		<p><div align="center">
		<img width="50%" src="/uploads/1b1a221a62ac71850b2c7bd606505ef0/1_plots.png">
		RandomForest
		</p>
		</div></p>

		#### Why are there different accuracies for the plots?
		While the linearly separable plot catches all the outliers because of the training with outliers and therefore doesn't have any test data points that are classified wrongly, the other plots have more difficulties in achieving a higher accuracy as they have more detailed decision boundaries depending on the training data. The outliers in the testing set are always very close to the decision boundary but still classified wrongly.<br>
		The calculated optimum max_depth for the parameter leads to a higher accuracy because with a higher depth of the tree more edge cases can be included and the training data is plotted more detailed. This could lead to overfitting on the training data but in our case a higher depth increases the accuracy.

		### Exercise 7
		#### Differences between RandomForrest and DecisionTreeClassifier forrest 🌳❓
		The first big difference that can be spotted is that the plot for the RandomForrest Classifier doesn't use long straight lines as decision boundaries but uses more detailed decision boundaries. Like this the plot does not look like squares put together anymore but in the case of the moon shaped dataset like curved moons and the circle dataset is plotted more to a circle. The linearly separable dataset is not separated by straight vertical lines but now has cutouts for datapoints with a different colour so that there are no additional vertical lines needed to classify the point correctly.<br>
		For classification of linearly separable datasets we get the highest accuracy (1) with both classifiers. With the other datasets, we always have overfitting while training on the training data, because there we reach the accuracy of 1 all the time, but for the testing the accuracy is always lower.
		### Exercise 9 🙋‍♀️
		- plot for the RandomForrest Classifier doesn't use long straight lines as decision boundaries but more detailed decision boundaries
		- RF: not just squares anymore, decision boundary more shaped like data (eg. like moons)
		- linearly separable dataset is not separated by straight vertical lines but now has cutouts for datapoints with a different colour -> no additional vertical lines needed to classify the point correctly
		- for classification of linearly separable datasets highest accuracy (1) with both classifiers
		- for both classifiers with other datasets: training accuracy of 1, but lower testing accuracy
		### Thoughts on a possible project topic 🙋‍♀️

		#### What do we want to do for our project?

		@@ -141,14 +162,14 @@ For classification of linearly separable datasets we get the highest accuracy (1
		+ author's gender
		+ emojis (for tweets)
		+ How to measure the performance?
		+ let several people rate song lyrics or tweets on their level of misogyny (binary classification or multicategorical → Likert scale)
		+ let several people rate song lyrics or tweets on their level of misogyny (binary classification or multi-categorical → Likert scale)
		+ split data into test and training
		+ train algorithm on test data
		+ accuracy: Matching of manual rating and categorization of the algorithm on the training data

		#### Should the use a decision tree for our project?
		Pros✅:
		- explainibilty
		- explainability
		- A decision tree is a white box. It is easy to understand how the decisions are made.
		- preprocessing
		- Compared to other algorithms, the preprocessing is a lot easier and less time consuming.
		@@ -160,4 +181,5 @@ Cons❌:
		- Rating the data will take much time. So, we might not be able to create a very large data set.
		- On a small data set, overfitting would probably be high.
		- may be become really complex
		- We will have maaany features. Thus, a decision tree would probably become really complex.
		- We will have maaany features. Thus, a decision tree could become really complex.
		- not good if data is not linearly separable
		No newline at end of file

project/parameter_opt.py

+46 −1

Original line number	Diff line number	Diff line
		import numpy as np
		import pandas as pd
		from sklearn.model_selection import GridSearchCV
		from sklearn.tree import DecisionTreeClassifier
		from sklearn.ensemble import RandomForestClassifier
		from sklearn.naive_bayes import CategoricalNB
		from sklearn.model_selection import GridSearchCV
		from sklearn.svm import SVC
		from sklearn.neural_network import MLPClassifier
		from sklearn.model_selection import StratifiedShuffleSplit
		from scipy.stats import randint as sp_randint
		from random import uniform

		data_folder = "./data/"
		train_data = pd.read_csv(data_folder + "data_train.csv")

		features_all = ["school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob",\
		"reason","guardian","traveltime","studytime","failures","famsup","paid","activities","nursery",\
		"higher","internet","romantic","famrel","freetime","goout", "health","math","port",\
		"schoolsup_both", "absences_avg","G1_avg","G2_avg","G3_avg"]

		# optimization pre-pruning:

		@@ -49,3 +61,36 @@ def parameter_rndfor_opt(X_train, y_train):
		# # fit model to data
		# rndfr_gscv.fit(X_train, y_train)
		# print(rndfr_gscv.best_params_)

		def parameter_SVM_opt(X_train, y_train):
		"""
		Function to find optimal parameters for RandomForest.
		"""
		# Parameter optimization
		C_range = np.arange(1, 6)
		gamma_range = np.arange(1, 6)
		param_grid = dict(gamma=gamma_range, C=C_range)
		cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
		grid = GridSearchCV(SVC(), param_grid=param_grid, cv=cv)
		grid.fit(X_train, y_train)
		print(grid.best_params_)

		X_train = train_data[features_all]
		y_train = train_data.Walc

		parameter_SVM_opt(X_train, y_train)

		def parameter_MLP_opt(X_train, y_train):
		parameter_space = {
		'hidden_layer_sizes': [(sp_randint.rvs(100,600,1),sp_randint.rvs(100,600,1),),
		(sp_randint.rvs(100,600,1),)],
		'activation': ['tanh', 'relu', 'logistic'],
		'solver': ['sgd', 'adam'],
		'alpha': [uniform(0.0001, 0.9), uniform(0.0001, 0.9)],
		'learning_rate': ['constant','adaptive']}
		grid = GridSearchCV(MLPClassifier(max_iter=2000), parameter_space, cv=5)
		# fit model to data
		grid.fit(X_train, y_train)
		print(grid.best_params_)

		# parameter_MLP_opt(X_train, y_train)
		No newline at end of file