Data sources could be databases from online stores, sensors, activities logged by social media platforms, or the data could have been generated. It is always worth knowing the origin and collection method of the data when analysing it. When collecting data, it is essential to inform the relevant actors in a given situation of the fact that their data is being collected, otherwise privacy or IP rights may be a concern.
Because of an inadequate dataset, the model may appear to be highly accurate, but then perform poorly in real-world tests. For example, if a model needs to recognise horses, but all the samples in the dataset show photos of horses in a field, in front of a blue sky, at eye level, the model will not be able to recognise a photo taken in a forest in foggy weather well. The model is overfitted on a type of data and is not general enough.
In addition to the fact that a good dataset is diverse, containing samples with different cropping, angles and lighting conditions for photographs, it is also important to have roughly the same number of samples from each class. This makes a dataset balanced, which also contributes to accuracy.
Before training, the data should be divided into three groups. The training data set is the largest, and is used by the algorithm to continuously recalculate weights and biases during training. The smaller validation data is used to measure the accuracy on a separate group of data in each iteration. The third set is reserved only for testing, and is never seen by the algorithm during training. It is sufficient to allocate a few samples per class for testing, and about 20% of the remaining data is usually allocated for validation.
The performance of the test dataset can be easily displayed on a matrix, where each class is placed on both the horizontal and vertical axes. The rows are the correct classes, the columns the classes predicted by the model, and the elements of the matrix are the number of combinations.