Skip to main content

Cornell University

Tata-Cornell Institute for Agriculture and Nutrition

The game of the name: matching names across datasets

Asha Sharma is a Post-Doctoral Associate with the TCi program, where she is working to quantify risks due to climate change on agriculture in India. Her research interests include the intersection of water resources, climate change, and food systems, and the estimation of trends in water resources in data-scarce regions. Asha earned Masters and PhD degrees in Biological and Environmental Engineering from Cornell University.

The gravity of the questions we pose in research is offset by the sheer mundanity of some of the day-to-day tasks involved in answering those questions. Over the past few decades, India has done quite a bit of reorganization of its internal political boundaries. It now has a handful of states more than the ones I learned about in elementary school. Even more importantly to my research, in the four decades between 1970 and 2011, the number of districts went up by more than 60% – to more than 600. (see Figure 1) In the four years since, another forty or so have come into existence!

Discussions of the merits and demerits of this seemingly endless boundary-fiddling aside, this poses a challenge when we try to use data from several decades, as climate changes studies require us to do. We need to be sure that the data we compare are consistent across time, and this means having a way to match which existing district was carved out from which old district, or in some cases, from portions of multiple districts. There are ways of dealing with this, for example having some sort of digital identifier for each district that can be traced to that of its “parent” or “child” district(s). However, to my knowledge, not only does India not have such a system, it can be hard to find this information on even the districts’ websites.

It is of course relatively simple to use spatial analysis software to figure out the overlaps, but this does not let us off the hook completely. We still need to manually check to make sure the matches are correct, and in some cases, they are not, mainly because the mismatched districts were small and the map boundaries were slightly different. I believe the main reason for this is the lack of access to “official” digital boundaries maps from across the years. After having gone through the painstaking process of matching new and old districts, I feel no one else should have to go through it again, and I will soon make these “matching data” available. (Please bear with me as I try to figure out the best way to do this.)

Another issue that comes up with Indian data is inconsistent district names. Transliterating names from the multitude of languages spoken there to English (the language for most national datasets) is understandably dicey. For example, the district in West Bengal could reasonably be spelled Darjeeling or Darjiling in English (the official spelling is Darjeeling). Layer on the intentional changing of spellings from the old Anglicized versions to ones that are perceived as being more true to the local name (e.g. Cuddapah to Kadapa in Andhra Pradesh). Then add names that were changed for any number of reasons, including honoring someone, e.g. Kadapa is now (also known as) Y. S. Rajasekhara Reddy district, for a former chief minister of the state, and the former Nawanshahr in Punjab is now Shahid Bhagat Singh Nagar, for the freedom fighter. Now, add inconsistent rules on abbreviations (YSR district or Y.S. Rajasekhara Reddy district) and other aspects of terminology (e.g. North Goa or Goa (North), South 24 Parganas or South Twenty-four Parganas, Leh (Ladakh) or Ladakh (Leh)), and what results is an endless possibility of names for a given place.

If this were not enough, there is the problem of having districts with similar or in some cases, the same, name in different states. (see Figure 2) We have a couple of Aurangabads, Bijapurs, Bilaspurs, Hamirpurs, Pratapgarhs, and Raigarhs (although in fairness, the one is Maharashtra is usually spelled Raigad, and other countries also have this issue, particularly the US, where there are a bewildering 31 Washington counties). I find this variety of names amusing, and indeed endearing, but it makes little sense that each person who needs to combine different datasets should be spending hours on this name-matching game. Again, this is a problem that is easily solved by having digital identifiers for districts and quality control for the use of official place names in datasets, or ideally both.

The need for better data is a hot topic in many research circles now, but along with the collection and availability of trustworthy data that is adequately representative for the questions we want to answer, we need the data to be in formats that are easy to use and match across multiple datasets. This will become increasingly important as we enlist insights and data from multiple disciplines in tackling the challenges facing society.

Have you ever faced similar problems with respect to Indian data? I would love to hear about how you tackled them. Feel free to contact me directly at ans62@cornell.edu.

PS: Speaking of spelling, I have used the American versions here since we are in the US. I really had to go against my grain to say spelled (spelt), learned (learnt, although this one may be more debatable) and worst of all, reorganized (reorganised).

P.P.S.: After writing this post, I came across a fantastic website that addresses many of the issues I raised above, most importantly the one of which new district came from which old district(s). It also has an extensive listing of spelling variations. The website is not limited to India, so if you come across a similar problem anywhere in the world, I recommend checking it out.