Update data part (b3cdba90) · Commits · igraf / exp-ml-2-hillengass-graf

project/README.md

+29 −10

Original line number	Diff line number	Diff line
		@@ -12,34 +12,53 @@ The motivation for this project is to explore the capabilities of machine learni

		## Related Work


		:construction:

		## Data

		### Train / Dev / Test Split

		Our project involves a dataset initially lacking a predefined split into training, development (validation), and testing sets. To tailor our dataset for effective model training and evaluation, we implemented a custom script that methodically divides the dataset into specific proportions.
		In our project, we have focused on 30 specific classes out of the 262 available fruit classes. This decision was based on the relevance and diversity we aimed to achieve in our model's learning scope. We selected classes that are typically found in supermarkets in Germany (e.g. apples, bananas, oranges, mandarins, strawberries, etc.) but also included some exotic fruits (e.g. passion fruits, buddha's hand). We deliberately included fruits that are similar in appearance (e.g. mandarins and oranges) to challenge the model's ability to distinguish between them. A comprehensive list of the selected classes can be found [here](data/class_counts.md).

		The original data set lacks a predefined split into training, development (validation), and testing sets. To tailor our dataset for effective model training and evaluation, we implemented a custom script that methodically divides the dataset into specific proportions.

		<figure>
		<img style="float: right; margin-left: 10px; margin-bottom: 10px" src="figures/dataset_split.png" alt= "Dataset Split" width="500" height="auto">
		</figure>


		The script is configured to split the dataset into
		- 70% for training
		- 15% for development (validation), and
		- 15% for testing.

		The script is configured to split the dataset into 70% for training, 15% for development (validation), and 15% for testing. This means, for each selected fruit class, typically consisting of around 1000 images, we allocate 700 images for training, 150 images for development, and 150 images for testing. This allocation ensures a significant amount of data is used for the model's learning while adequately reserving a representative number of images for validation and testing.

		In our project, we have focused on 30 specific classes out of the 262 available fruit classes. This decision was based on the relevance and diversity we aimed to achieve in our model's learning scope. A comprehensive list of these selected classes can be found [here](data/class_counts.md).
		This means, for each selected fruit class, typically consisting of around 1000 images, we allocate around
		- 700 images for training
		- 150 images for development, and
		- 150 images for testing.

		This allocation ensures a significant amount of data is used for the model's learning while adequately reserving a representative number of images for validation and testing.


		We opted not to use cross-validation for the train-dev split, considering the large size of our dataset. With approximately 1000 images in each of the 30 selected classes, our dataset is substantial enough to ensure effective training and validation. This approach negates the necessity for cross-validation, allowing for a streamlined and efficient training process.

		The data partitioning script randomly segregates the images for each of the chosen fruit classes into the designated training, development, and testing sets. This random allocation is pivotal for maintaining the data integrity and representativeness in each subset, facilitating an unbiased evaluation of the model's performance.
		The data partitioning script randomly segregates the images for each fruit class into the designated training, development, and testing sets. This random allocation per class is pivotal for maintaining the data integrity and representativeness in each subset, facilitating an unbiased evaluation of the model's performance.


		### Data Statistics

		As part of our dataset analysis, we've focused on understanding the distribution of images across the 30 selected fruit classes in our dataset. Now this dataset is organized into train, dev, and test subsets, and our analysis aims to provide a clear overview of the image distribution across these classes and subsets.
		As part of our dataset analysis, we have focused on understanding the distribution of images across the 30 selected fruit classes in our dataset. Now this dataset is organized into train, dev, and test subsets, and our analysis aims to provide a clear overview of the image distribution across these classes and subsets.

		With a total image count of 29430 images across the training, development, and testing subsets, our dataset is comprehensive and representative of the 30 selected fruit classes.


		Total Image Count: We counted a total of 29430 images across the training, development, and testing subsets. This total count is indicative of the dataset's comprehensive nature.
		To visually represent the class-wise distribution, we plotted a histogram:

		Class-wise Distribution: To visually represent this distribution, we plotted a histogram.
		<img src="figures/class_distribution_histogram.png" alt= "Histogram" width="800" height="auto">

		![Histogram](figures/class_distribution_histogram.png)

		Balance of Dataset: The histogram provides visual and easy to see insights into whether the dataset is balanced or unbalanced. In our case the dataset is mostly balanced. The nectarine, orange and jostaberry classes may have insufficient datapoints. We will maybe exchange these classes with different fruits.
		Balance of Dataset: The histogram provides visual and easy to see insights into whether the dataset is balanced or unbalanced. In our case, the dataset is mostly balanced. The nectarine, orange and jostaberry classes may have insufficient datapoints. We will keep an eye on the performance of our models on these classes to see if the imbalance has an impact on the model's ability to learn and predict these classes :mag:.


		## Metrics