Digging for Data: How TCI Helped ICRISAT Expand Its India Database
In 2018, the Tata-Cornell Institute for Agriculture and Nutrition (TCI) and its partner organization, the International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), began an initiative to create a comprehensive, one-stop shop for data related to India’s food system and beyond. In this blog post, TCI research support specialist Kiera Crowley recounts how she tracked down data and learned to code in order to help build the District-Level Database for Indian Agriculture and Allied Sectors.
When I first started working at TCI, I was presented with a challenge. We wanted to compare India’s earliest-available district-level agricultural data to the latest data available, but at that time no one had brought it all together in one place. An up-to-date, time-series database at the district level would help us to visualize trends in agriculture across both time and location, and to see how policies and varying levels of infrastructure development, such as irrigation or agricultural implements, have shaped those trends. This is especially important in India, where different states have experienced varying levels of development and economic growth. It would help us to see how those trends are related to nutritional outcomes and to identify where changes to agricultural policies could help to improve peoples’ livelihoods and nutrition.
One of TCI’s partners, ICRISAT, had already compiled a database going back to the 1960s, but it had not been updated since 2011. We were excited to work with newer data as soon as possible, so I began downloading the most recent data from various websites operated by the Indian government and matching it with the ICRISAT data to enable comparison.
The first challenge was just to collect the data. Data on government websites was not easily accessible and there was no option for bulk downloads. Rather than clicking through multiple drop-down menus to download data one state at a time, one dataset at a time, TCI postdoctoral associate Andaleeb Rahman suggested that I write a code to scrape the data from the websites. There was just one problem: I had no experience writing that kind of code. In my previous work as a master’s student, I had only written codes to run relatively straight-forward statistical analyses. This would be much more complicated. Fortunately, I got a crash course from a friend and soon wrote my first scraping code using a programming language called Python.
It is our hope that this data will enable all researchers and policymakers interested in Indian food systems to make sense of agricultural trends and create policies that will improve livelihoods and nutrition in an environmentally sustainable manner.
With that code, I had a template that I’ve been able to modify to scrape additional websites. I’ve downloaded data on area classification (land use), area irrigated by crop, area irrigated by source, crop area and production, agricultural wages, and farm harvest prices. Some of the data, like agricultural wages and farm harvest prices, was in PDF format and required additional coding to convert it to a usable format.
But another challenge still stood in the way: India’s state and district borders have changed since the 1960s. In order to compare the most recent data to the older data, the 571 modern districts in the 20 states for which we collected data had to be matched to the 313 districts that existed in those states in the 1960s. Fortunately, ICRISAT had already created a list of districts formed after 1966 and their parent districts. Using this list, I needed to write an apportioning code that takes data from new districts and gives it back to the old ones. This is simple enough when new districts were formed from just one parent district (for example, if a district was split in three, the data from those three districts could be added up and given to the parent district). However, if new districts came from more than one parent district (for example, 40% of the new district is from parent district A, and 60% is from parent district B), this becomes more complicated.
In the summer of 2018, I met with the ICRISAT staff in Hyderabad to discuss the apportioning of the database. With their input and a list of new districts with corresponding parent districts and apportioning percentages, I was able to finish writing the apportioning code. ICRISAT used the code to create apportioned versions of an array of datasets, all of which are now available through an easily accessible platform on the District-Level Database website.
TCI is currently using the database to create district-level maps of agriculture and nutrition trends. These maps help to tell the story of how regions that have been successful in the agricultural sector have experienced better nutrition outcomes than those who have lagged in agricultural development. It is our hope that this data will enable all researchers and policymakers interested in Indian food systems to make sense of agricultural trends and create policies that will improve livelihoods and nutrition in an environmentally sustainable manner.
Kiera Crowley is a Research Support Specialist at the Tata-Cornell Institute. Her research focuses on mapping trends in food, agriculture, and nutrition in India.