Do state-of-the-art deep learning techniques allow for individual delineation of each cropland, as suggested by this recent article? In the context of an internship at CESBIO, we tried to evaluate the performance of Mask R-CNN architecture for this task. In this post, we summarize what we learned.
Mask R-CNN principles
To sum up, Mask R-CNN is an architecture made of three main parts. First, there is a convolutional network called backbone, which produces features from an input image. From these features, a second part (called RPN for Region Proposal Network) proposes and refines a certain number of regions of interest (as rectangular bounding boxes), which are likely to contain a single cropland. Finally, the last part extracts the best proposals, refines them once again, and produces a segmentation mask for each of them.
Left: Proposals kept by RPN; Right: Final detections made by Mask R-CNN along with their segmentation masks.
The figure below illustrates this network. It is extracted from this short external article, where you can learn more about Mask R-CNN. In addition, another noticeable fact about this network is that is contains a lot of convolution layers - about a hundred - that makes it complex to manipulate. Indeed, it is difficult to explain results and trends.
Mask R-CNN in practice
In order to train this network, we used the RPG database (Registre Parcellaire Graphique), distributed once a year by IGN. Since this database is not complete, we added a complement, provided by the ODR (Observatoire du Développement Rural, an INRAE lab). We simplified our approach de by defining only one class, removing information about crop types provided in the mentionned databases.
As input data, we used level 2A Sentinel-2 images provided by Theia. More precisely, we used the 4 spectral bands at 10m resolution (red, green, blue and near infrared). We selected 7 MGRS tiles over French territory. For each of them, we chose 4 different dates, included in growing periods. We also have super-resolution images (at 5m rather than 10), which are produced thanks to a previous work at CESBIO (using a Cascading Residual Network). These images are sharper than the 10m Sentinel-2 bands, so that we expected a better accuracy on predicted contours when training models on these super resolution images.
Selected tiles (31UDR and 31UEP are our test tiles, others are for training and validation)
Mask R-CNN has been tested in the litterature for this cropland instance segmentation task, so we tried to reproduce their work. Although this paper uses a tensorflow-based implementation, we first tried to reproduce their results using the one contained in Torchvision module from Pytorch. Yet, we never managed to make this implementation to converge, and we also noticed that there are actually many differences compared between the two implementations. Particularly, the provided pretrained models were trained on Image-Net for Pytorch and COCO (which has a greater density of objects) for Tensorflow. Moreover, pretraining involves only the backbone for Pytorch, but the whole network for tensorflow. Finally, the order of some layers and the choice of the features used during the last step of the network are different. We tried to indentify the main elements that allowed Tensorflow network to converge, without success.
Using the Tensorflow implementation, we evaluated several training use cases, that can be separated in three groups, described in the table below. The first group contains the first three cases. Each of them uses only one date from our training tiles, in order to test the ability of the model to generalise to a data it has never seen. The first three channels are always RGB bands, then NIR or NDVI is added as a fourth channel. The next three cases try to use multi-temporal input data. Either we used all the dates for each training tile (so we get a 4 times larger patches dataset), or we extract NDVI from each date and then stack them (thus using a multi-temporal NDVI stack). Finally, the last three cases use super-resolution images.
|T09NIR||Septembre||BVR - PIR||10 m|
|T06NIR||Juin||BVR - PIR||10 m|
|T09NDVI||Septembre||BVR - NDVI||10 m|
|TADNIR||Toutes||BVR - PIR||10 m|
|TADNDVI||Toutes||BVR - NDVI||10 m|
|TMTNDVI||Toutes (empilées)||NDVI (dates 1 à 4)||10 m|
|T09NIRSR||Septembre||BVR - PIR||5 m|
|T09NDVISR||Septembre||BVR - NDVI||5 m|
|T09NDVISRSA||Septembre||BVR - NDVI||5 m|
Thanks to these cases, we get some qualitatively - but not quantitatively - interesting results. During an inference, a set of polygons is produced, each with a confidence score. On this first score we can set a first threshold, below which predictions will not be considered.
Then, to match them with our target polygons, we used a geometrical criterion, illustrated on the right. This criterion estimates if a target (green rectangle) and a prediction (red ellipse) have a sufficient intersection (in yellow) to consider their match as valid. By computing two ratios (yellow out of green and yellow out of red), then taking the minimum, we ensure that we are as restrictive as possible. The value obtained here is called RC, and we can also set a threshold on it, in order to be more or less restrictive on the quality of our predictions. Once matches between predictions and targets have been computed, we can compute some classic detection metrics - precision, recall and F1-Score. As a reminder, precision corresponds to the number of predictions that are actually croplands, recall corresponds to the number of targets detected among all targets, and F1-Score makes a compromise between the first two.
Among our use cases, T09NDVISR and TMTNDVI are those which have produced the best results, as we can see in the table below. Here, we used rather restrictive thresholds, i.e. 0.8 for both confidence and RC.
|T06NIR (sur juin)||35.34||25.61||29.7|
|T06NIR (sur septembre)||31.51||21.45||25.52|
|T09NIR (sur juin)||30.08||20.8||24.6|
|T09NIR (sur septembre)||30.31||21.79||25.35|
|T09NDVI (sur juin)||31.03||21.43||25.35|
|T09NDVI (sur septembre)||29.79||21.21||24.78|
To compare our performance with authors' ones, from the paper mentionned above, we chose a RC threshold at 0.5 and a confidence one at 0.7. The results are the following:
|OSO (sur 31UDR)||45.89||22.54||30.23|
Our multi-temporal NDVI stack seems to provide relevant information, and manages to overcome authors' results (computed on their own test tiles). Super resolution provides results a bit lower in terms of detection metrics - actually, at the level of the authors' ones (with a better recall but a lower precision). However, good predictions seem to have a better quality. Indeed, the contours provided (we can see examples, at two different scales, on images below) are closer to the reference croplands than for the other models. Therefore, training with super resolution images seems to produce more accurate contours.
Target croplands (green) and prediction made by T09NDVISR model (blue)
Finally, we also ran our matching process on the vectorised OSO product. We used tile 31UDR for this test, and we can see that the recall obtained is very low - since the OSO product will merge neighbouring croplands of the same type, a large number of targets will be undetected. This justifies the interest in using an instance-based segmentation approach. Despite this, the results obtained are, at this stage, far from being an end-user product. It should be noted, however, that due to the use of RC as a matching criterion, we are unable to measure fragmentation (i.e. detecting a target with several predictions) or agglomeration (i.e. detecting several targets with a single prediction). Yet, on the images, some fragmentation or agglomeration could seem legitimate, se we obviously underestimate our performances.
Article written in the frame of my CESBIO internship funded by CNES. May thanks to Julien, Jordi and Olivier for their help.