Автор: Martin Elff
Издательство: SAGE Publications
Год: 2021
Страниц: 351
Язык: английский
Формат: pdf, epub. mobi
Размер: 10.2 MB
An invaluable, step-by-step guide to data management in R for social science researchers.
This book will show you how to recode data, combine data from different sources, document data, and import data from statistical packages other than R. It explores both qualitative and quantitative data and is packed with a range of supportive learning features such as code examples, overview boxes, images, tables, and diagrams.
The free and open source software package R is perhaps mainly known for the quality of graphics that one can create with it and for the availability of many of the most advanced techniques of data analysis. But in order to create high-quality graphics or to conduct state-of-the-art statistical analyses, one first needs to prepare the data for this. Yet data preparation can be a challenge because data are not always ‘ready for use’. The first challenge one may face here is to actually ‘get the data into’ R – that is, to make the software understand the format in which the data are arranged in a computer file and to load the data from the file. The second challenge is to rearrange the data once loaded into R memory in such a way that graphics can be produced and data analyses can be conducted. There are two sources for this challenge: the first is that the implementation of many data-analytical techniques is based on a very specific assumption about how the data are structured, and the second is that the data one wants to analyse are only a subset of the available data or, conversely, are distributed in several data objects and data files. Dealing with these challenges is the content of data management and the topic of this book.
This book gives an overview of the main kinds of data that social scientists may encounter in their research and how they can work with these different kinds using R. The present chapter reviews the basic data concepts relevant for working with R, such as objects, variables, and workspaces, as well as packages, data files, and scripts. The second chapter discusses the building blocks and concepts from which almost all other data structures in this book are built: vectors, lists, attributes, and classes. The third chapter discusses the management of data frames, the kind of data object that corresponds to what many researchers who have worked with other software have the most experience with: data arranged in rows and columns, where rows correspond to – depending on the terminology of the discipline and the software being used – observations, cases, objects, units, or individuals, and columns correspond to variables that contain measurements. It also discusses how such data can be imported into R. For those readers who have already worked with R and merely want to expand their knowledge about how to tackle data management issues or how to work with some unusual types of data, much, if not all, of what is discussed in the first three chapters will seem quite familiar. Nevertheless, a book on data management would be incomplete without a discussion of these topics, while a discussion of these concepts may be a useful point of reference in later chapters. Further, introductory texts on data analysis with R do not give much room to these topics, so reviewing them in this book may be worth the while.
The fourth chapter discusses extensions to data frames that are provided by the extension package data.table and the ‘Tidyverse’ collection of extension packages. They provide a re-implementation of the data management functions discussed in Chapter 3, with the aim of increasing their efficiency. These packages appear to be highly popular among people engaged in data science, but they are not indispensable for the kind of data management that social scientists are usually concerned with. The focus of this chapter is therefore a comparison of the approaches of these packages with what is already available from a standard installation of R.
Chapter 5 discusses how a serious limitation of the standard installation of R can be addressed, the limited support for what researchers who work with social science surveys and with commercial packages such as IBM SPSS and Stata can take for granted – that is, variable labels, value labels, and missing value declarations. These limitations can be overcome with the help of the extension package Memisc, which has been created and is maintained by the author of this book.
Contents:
Скачать Data Management in R: A Guide for Social Scientists