Автор: Fru Kingsly
Издательство: Leanpub
Год: 2021-01-21
Страниц: 1206
Язык: английский
Формат: pdf, epub
Размер: 13.2 MB
The goal of this book is to provide beginners and intermediate users of R, an opportunity to learn practically all there is to know about data wrangling, cleansing and exploration. Before data can be visualised or used in analysis and reporting, it needs to be in the right shape and be of the right type, this book helps you achieve all these without having to search the internet every now and then. This book will teach you programming with R, plotting with base graphics and ggplot2, importing data from diverse sources, manipulating dates, texts and data frames, and dealing with outliers, missing and duplicate data.
Data wrangling is one of the most important steps in data science and analytics, for it is claimed that it takes between 80% to 90% of an analyst’s time. Data wrangling goes by many names including data munging, data manipulation, data preparation and data transformations. Just as there are many names to data wrangling, there are also many definitions to it. Below we look at two of the most important ones:
TRIFACTA which is a leading provider of data wrangling software by the same name defines data wrangling as:
“Data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time”.
Gartner defines data wrangling as:
“Data preparation is an iterative-agile process for exploring, combining, cleaning, and transforming raw data into curated datasets for self-service data integration, data science, data discovery, and BI/analytics”.
Clearly from the above, we can deduce that data wrangling is the process of converting raw data from one form to another that is appropriate for a specific task at hand. It is rare in analytics to receive data in the form and shape that we want to perform our analysis. Most often, we will be required to transform, clean, enrich and explore that data before we move to our analysis.
Data wrangling involves:
Importing and exporting dаta: to and from csv, excel, databases etc.
Cleaning dаta: identifying and dealing with missing data, outliers, and duplicates
Manipulating text and categorical data
Manipulating dates
Encoding and enriching data
Manipulating columns and rows
Split-apply-combine data
Merging data
Reshaping data
Grouping and Aggregating data
Exploring data
Data exploration is and should be the initial step of any data analysis project. It is a mini form of data analysis in which we make use of both descriptive statistics and data visualization techniques to better understand our dataset. With traditional analysis and research, we know with exactitude what we are after (that is the hypothesis is known) before collecting data. With exploratory analysis, the process is reversed; we assume little or no information about the outcome of the analysis but instead explore the data to come up with some meaningful insight or hypothesis. Data exploration involves:
looking at the structure and size of the data
looking at the completeness and correctness of the data
looking at the possible relationships that may exist between data elements
As can be observed, the boundary between data exploration and data wrangling is blurred because both make use of data cleaning techniques to make sure that the data is correct and complete for data analysis.
Who is this book for:
This book is for anybody interested in understanding and applying data manipulation and transformation techniques with R.
How is this Book Structured:
It is divided into seven parts which include:
Part1: Programming with R (chapter 1 to 15)
Part2: Import and export data (chapter 16 to 18)
Part3: String and categorical data manipulation (chapter 19 to 21)
Part4: Date manipulation (chapter 22 to 24)
Part5: Data manipulation (chapter 25 to 28)
Part6: Data cleaning (chapter 29 to 30)
Part7: Data exploration (chapter 31 to 32)
Скачать Effective Data Wrangling and Exploration with R