
Training data from Functional Map of the World. This data could be used as a scoring set to encourage developers to build and improve models so they excel beyond the Western world. Building a separate validation set would be a valuable tool in ensuring that algorithms generalize to the global South. These can be useful in training new algorithms and also for evaluating the generalizability of existing models. We need to expand on these and create more high quality training datasets that are representative of many parts of the world. There are a few great training data sources for satellite machine learning data: SpaceNet, Functional Map of the World, and xView. The World Bank, United Nations, and others groups should work to provide open data sets that are standardized across countries. To power new machine learning tools, it is more critical than ever to work with governments to generate open data that creates knowledge and value from all that open imagery. This data is paired with readily available, open satellite imagery from sources like Landsat and Sentinel to create a predictive algorithm.īut while the imagery provides worldwide coverage, open data on agricultural yields doesn’t exist for many crops or in many countries. Engineers are able to develop accurate agricultural models for predicting corn output in the US because there are decades of good data about corn yields from the US Department of Agriculture. Open data and data readiness take on a new importance in the context of machine learning. And therefore we are missing important tools for understanding and correcting bias. What we realized is that there aren’t good tools to evaluate, improve, and articulate the level of geo-diversity in our training data. Still these biases can have real impacts. But in many cases the bias in our models might be subtler and harder to detect. In the Pakistan case, we were able to recognize representation issues due to obvious geographical differences in model performance (seen above). In the Eastern agricultural region, the model predicted many more false positives as we did not obtain any training data from this region. In the mountainous desert region of Pakistan, the model performed relatively well. Our colleagues at N/Lab in the University of Nottingham have experienced this phenomenon in their work and are currently working to better quantify the impact of geo-diversity in this space.īlack dots correspond to locations where the model predicts a high voltage tower was present. When we applied that model to predict power lines in the rest of the country, our model made more errors in farmland and other terrain that was underrepresented in our training data. We recently mapped the electricity grid in Pakistan using training data from one region of the country. Even within a country you can see big differences. Roads, buildings, agriculture, settlements, and poverty look different from space depending on where you are in the world. This tracks with our own experience applying machine learning to analyze satellite imagery. No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World We suggest that examining the geo-diversity of open data sets is critical before adopting a data set for use cases in the developing world. So while these datasets have been invaluable as benchmarks for advancing the machine learning field, their use in production models should be called into question. Specifically, they found that ImageNet and Open Images “appear to exhibit an observable amerocentric and eurocentric representation bias”. The Google Brain Team recently released a paper investigating geographic bias in open data sets for machine learning. This post lays out a vision for creating geo-diverse data sets. To build algorithms that work for the whole world we must use training data that better represents the world’s diversity. Ultimately these tools are as good as the data that they learn from. Or self-driving trucks which can navigate over asphalt roads but fail to reach their destination over dirt or gravel. This results in automated irrigation equipment that works well for certain crops or areas but which overwaters in others. However, there is increasing recognition that these tools can return incorrect and even biased results. New machine learning techniques hold tremendous potential to revolutionize our ability to solve complex social and environmental problems.
