A comprehensive beginner-to-advanced reference covering installation on Windows (32-bit & 64-bit), macOS, Linux, and Termux Android — plus packages, visualization, statistics, data structures, and machine learning modeling.
Keywords: R programming language, R for data science, install R on Windows 32-bit 64-bit, install R on macOS, install R on Linux, install R on Termux Android, RStudio tutorial, CRAN packages, R vs Python, data visualization in R, statistical modeling in R, ggplot2, dplyr, R advantages and disadvantages, hierarchical clustering R, PCA in R, regression analysis R
Table of Contents
- What Is R? — The Language of Data Science
- Why Learn R? Key Reasons & Use Cases
- R vs. Python — Full Comparison
- Install R on Windows (32-bit & 64-bit)
- Install R on macOS (Intel & Apple Silicon)
- Install R on Linux (All Distros)
- Install R on Termux (Android)
- Installing RStudio — The Premier IDE
- Understanding the R & RStudio Interface
- R Packages — Your Superpowers
- Graphics & Data Visualization in R
- Basic Statistics in R
- Data Types & Structures in R
- Entering & Importing Data
- Statistical Modeling in R
- Advantages of R Programming
- Disadvantages of R Programming
- Common Errors, Resources & Next Steps
1. What Is R? — The Language of Data Science
R is a free, open-source programming language and statistical computing environment designed from the ground up for data analysis, statistical modeling, and graphical visualization. Created in the early 1990s by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, R has grown into the world’s most trusted language for data science, academic research, and quantitative analysis.
Unlike general-purpose languages such as Python, Java, or C++, R was purpose-built with statisticians and data analysts in mind. Every core feature — from its vector-first data model to its formula syntax for statistical models — reflects deep consideration for how people actually work with data. A single line of R can express a complex statistical idea that would take dozens of lines in other languages.
In a landmark survey of data mining and data science professionals, R ranked #1 as the most frequently used tool — with usage approximately 50% higher than Python, the next most popular option. R’s dominance in pure statistics and academic research is undisputed.
R powers thousands of peer-reviewed research papers every year, clinical trial analyses submitted to the FDA, financial risk models at major banks, genome sequencing pipelines in bioinformatics labs, and award-winning data journalism at major news organizations. If you work with data seriously, understanding R is foundational.
As R user Simon Blomberg famously said: “In R, there is no ‘if’ — only ‘how.'” With over 18,000 packages available on CRAN alone, whatever you need to do with data, R almost certainly already has a solution.
2. Why Learn R? Key Reasons & Use Cases
Core Reasons to Learn R
- Completely Free: R is licensed under the GNU General Public License. Commercial alternatives like SAS ($9,000+/year), SPSS ($6,000+/year), and MATLAB ($2,000+/year) are prohibitively expensive for individuals and students.
- Open Source Transparency: Every function can be inspected, audited, and modified — critical for scientific reproducibility and integrity.
- Vectorized Operations: R processes entire arrays of data without explicit loops. Operations requiring for-loops in other languages are often single-line expressions in R.
- Latest Statistical Methods: New statistical methods appear in R packages within days of their academic publication — often authored by the researchers who invented them.
- 18,000+ Packages on CRAN: Every conceivable statistical and data science task has a purpose-built package: Bayesian inference, spatial analysis, text mining, network analysis, clinical trials, and much more.
- Publication-Quality Graphics: R, particularly through
ggplot2, produces visualizations used by The New York Times, The Economist, the BBC, and FiveThirtyEight. - Reproducible Research: R Markdown and Quarto combine code, results, and narrative in a single document rendered to HTML, PDF, Word, or slides.
- Massive Community: Millions of R users worldwide. R User Groups (RUGs) in hundreds of cities. Active Stack Overflow community. Weekly #TidyTuesday challenges on social media.
- Interoperability: R reads data from Excel, CSV, JSON, SQL, SPSS, Stata, SAS, APIs, and dozens of other formats — the universal data hub.
- Industry Standard in Academia: R is the de facto statistical language for research at universities, hospitals, government agencies, and pharmaceutical companies worldwide.
R Use Cases by Domain
| Domain | How R Is Used | Key Packages |
|---|---|---|
| Finance & Banking | Portfolio optimization, risk modeling, algorithmic trading, time series forecasting | quantmod, PerformanceAnalytics, fPortfolio |
| Healthcare & Pharma | Clinical trials, drug efficacy, survival analysis, FDA regulatory submissions | survival, lme4, Hmisc, rms |
| Academic Research | Peer-reviewed analysis, meta-analysis, reproducible research pipelines | psych, metafor, lavaan |
| Marketing & Business | Customer segmentation, A/B testing, churn modeling, marketing mix modeling | caret, tidymodels, mlogit |
| Bioinformatics | Genomics, proteomics, single-cell RNA-seq, pathway analysis | Bioconductor, DESeq2, Seurat, ape |
| Government & Policy | Census analysis, public health surveillance, policy evaluation | survey, srvyr, tidycensus |
| Data Journalism | Data-driven reporting, interactive charts, election analysis | ggplot2, plotly, leaflet, htmlwidgets |
| Machine Learning | Classification, regression, clustering, neural networks | caret, tidymodels, xgboost, randomForest |
| Environmental Science | Climate modeling, ecological statistics, spatial analysis | raster, terra, vegan, spatstat |
| Sports Analytics | Player performance, injury prediction, game strategy optimization | baseballr, nflreadr, StatsBombR |
3. R vs. Python — Full Comparison
Both R and Python are excellent tools for data science. Many professionals use both. Here is a comprehensive, honest comparison:
| Feature | R | Python |
|---|---|---|
| Primary Design Purpose | Statistical computing & data analysis | General-purpose programming |
| Syntax Philosophy | Declarative, formula-based, vector-first | Object-oriented, imperative, explicit |
| Learning Curve | Easier if you have a statistics background | Easier if you have a programming background |
| Data Visualization | Outstanding — ggplot2 is industry-leading | Good — matplotlib, seaborn, plotly |
| Statistical Modeling | Excellent — native and deeply integrated | Good — via scikit-learn, statsmodels |
| Machine / Deep Learning | Good — caret, tidymodels, keras | Dominant — TensorFlow, PyTorch |
| Web Development | Limited — Shiny for dashboards only | Full-stack — Django, Flask, FastAPI |
| Package Repository | CRAN: 18,000+ (quality-controlled) | PyPI: 400,000+ (open submission) |
| Best IDE | RStudio — purpose-built, outstanding | VS Code, PyCharm, Jupyter Notebook |
| Community Focus | Academic, statistics, biomedical, research | Industry, engineering, AI/ML startups |
| Job Market | Strong in research, biostatistics, pharma | Dominant in software industry & AI |
| Reproducibility Tools | R Markdown, Quarto (excellent) | Jupyter Notebooks (widely used) |
Verdict: Choose R if your work centers on statistics, clinical research, academic publication, bioinformatics, or advanced data visualization. Choose Python for production software, automation, or deep learning at scale. For serious data scientists, learning both is the ideal path.
4. Install R on Windows (32-bit & 64-bit)
R supports both 32-bit and 64-bit Windows. Since R 4.2, only 64-bit is included by default. Older versions (≤ 4.1) allowed installing both simultaneously. The official download source is:
Official URL: https://cloud.r-project.org
Step-by-Step: Windows Installation
- Open your browser and go to https://cloud.r-project.org
- Click “Download R for Windows”
- Click “base” — the standard installer for most users
- Click the download link (e.g.,
R-4.x.x-win.exe) - Right-click the downloaded file and choose “Run as administrator”
- Accept the license agreement and click Next
- Keep the default installation directory:
C:\Program Files\R\R-4.x.x - Select components: choose 64-bit Files (default and recommended). Only add 32-bit if you need legacy package compatibility.
- Accept all remaining defaults and click Finish
- Verify: open Command Prompt and type
R --version
32-bit vs. 64-bit — Complete Comparison
| Property | 32-bit (i386) | 64-bit (x64) — Recommended |
|---|---|---|
| Maximum RAM addressable | ~3 GB (hard ceiling) | Limited only by physical RAM (terabytes) |
| Native DLL compatibility | Works with 32-bit DLLs | Cannot load 32-bit native DLLs |
| Performance on 64-bit OS | Degraded (runs via WOW64 emulation layer) | Full native performance |
| Legacy package support | Required for some older packages | All modern packages fully supported |
| Windows compatibility | Windows XP SP2 and later | Windows 7 SP1 and later |
| Available in R 4.2+ | No — removed from R 4.2 onward | Yes — the only option in R 4.2+ |
| Executable path | Ri386\bin\R.exe |
R64\bin\R.exe |
Installing Rtools (Required for Package Compilation)
Many R packages contain C, C++, or Fortran code that must be compiled. On Windows this requires Rtools — a separate download from CRAN.
- Download Rtools from: https://cran.r-project.org/bin/windows/Rtools/
- Choose the version matching your R version (e.g., Rtools44 for R 4.4.x)
- Run the installer and accept all defaults
- Verify inside R:
Sys.which("make")— should return the path to make.exe
Windows file path tip: Use forward slashes in R file paths: "C:/Users/You/data.csv" instead of "C:\Users\You\data.csv".
# Verify R installation in Command Prompt
R --version
# Verify Rtools is working (run inside R)
Sys.which("make")
5. Install R on macOS (Intel & Apple Silicon)
R provides native macOS packages for both Intel (x86_64) and Apple Silicon (ARM64) Macs. Since R 4.1, dedicated Apple Silicon builds run natively on M1, M2, and M3 chips for significantly better performance.
Identify Your Mac’s Chip
- Click the Apple menu → About This Mac
- “Chip: Apple M1/M2/M3” = Apple Silicon → download the arm64 package
- “Processor: Intel Core i5/i7/i9” = Intel Mac → download the x86_64 package
Step-by-Step: macOS Installation
- Go to https://cloud.r-project.org and click “Download R for macOS”
- Download the correct package:
- Apple Silicon (M1/M2/M3):
R-4.x.x-arm64.pkg - Intel Mac:
R-4.x.x-x86_64.pkg
- Apple Silicon (M1/M2/M3):
- Double-click the downloaded
.pkgfile to launch the installer - Follow the steps: Introduction → Read Me → License → Install
- Enter your macOS administrator password when prompted
- Click Close when complete
- Verify: open Terminal and type
R --version
Additional Required Tools
# Install Xcode Command Line Tools (required for compiling R packages)
xcode-select --install
# Install Homebrew (recommended for managing system dependencies)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install common libraries needed by R packages
brew install libxml2 openssl@3 curl libjpeg libpng
# Verify R works
R --version
Additional notes:
- XQuartz: Some graphics functions need XQuartz. Download from xquartz.org
- GFortran: Required for packages with Fortran code (like lme4). Download from
https://cran.r-project.org/bin/macosx/tools/ - Homebrew alternative:
brew install --cask r— works but the CRAN .pkg is more stable
6. Install R on Linux (All Major Distributions)
Linux offers the most flexible R installation. CRAN maintains official repositories for all major distributions providing the latest R versions — more up-to-date than default system repositories.
Ubuntu & Debian (apt-based)
# Step 1: Install prerequisites
sudo apt update
sudo apt install -y dirmngr gnupg apt-transport-https \
ca-certificates software-properties-common wget
# Step 2: Add CRAN GPG signing key
wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc \
| sudo gpg --dearmor -o /usr/share/keyrings/r-project.gpg
# Step 3: Add CRAN repository
# Replace "jammy" with your Ubuntu codename
# Run: lsb_release -cs to find your codename
# focal = 20.04 | jammy = 22.04 | noble = 24.04
echo "deb [signed-by=/usr/share/keyrings/r-project.gpg] \
https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/" \
| sudo tee /etc/apt/sources.list.d/r-project.list
# Step 4: Update and install R
sudo apt update
sudo apt install -y r-base r-base-dev
# Step 5: Install common system libraries for R packages
sudo apt install -y \
libcurl4-openssl-dev libssl-dev libxml2-dev \
libfontconfig1-dev libharfbuzz-dev libfribidi-dev \
libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev
# Step 6: Verify
R --version
Fedora / RHEL / CentOS / AlmaLinux / Rocky Linux
# Fedora
sudo dnf install R R-devel
# RHEL 8/9, AlmaLinux, Rocky Linux — enable EPEL first
sudo dnf install epel-release
sudo dnf install R R-devel
# CentOS 7 (uses yum)
sudo yum install epel-release
sudo yum install R
# Development libraries for package compilation
sudo dnf install libcurl-devel openssl-devel libxml2-devel
# Verify
R --version
Arch Linux & Manjaro
# Install R from official Arch repositories
sudo pacman -S r
# Install base development tools
sudo pacman -S base-devel
# Verify
R --version
openSUSE
# openSUSE Leap 15.x
sudo zypper addrepo \
https://download.opensuse.org/repositories/devel:languages:R:base/openSUSE_Leap_15.5/ \
CRAN-R-base
sudo zypper refresh
sudo zypper install R-base R-base-devel
Running R from the Linux Terminal
# Start interactive R session
R
# Run an R script non-interactively
Rscript my_analysis.R
# Run R code directly from the terminal
Rscript -e "summary(iris)"
# Install a package as system administrator (available to all users)
sudo Rscript -e "install.packages('ggplot2', repos='https://cloud.r-project.org')"
7. Install R on Termux (Android)
Termux is a terminal emulator and Linux environment for Android that lets you run R directly on your phone or tablet. This gives you a portable, pocket-sized R environment — ideal for learning, quick analyses, and working without a laptop.
Prerequisites
- Android 7.0 (Nougat) or later — Android 9.0+ recommended
- Termux installed from F-Droid (strongly recommended — the Google Play version is outdated and broken). Visit f-droid.org
- At least 2 GB free internal storage
- 3 GB+ RAM recommended; 4–8 GB for comfortable use
- Internet connection for initial package download
Important: Always install Termux from F-Droid, not Google Play. The Google Play version has not been updated since 2020 and is broken for most use cases. You may need to enable “Install from unknown sources” in Android security settings to install F-Droid.
Step-by-Step: R on Termux
# Step 1: Update Termux packages
pkg update && pkg upgrade -y
# Step 2: Install system dependencies
pkg install -y wget curl git openssl-dev libxml2-dev
# Step 3: Install R
pkg install -y r-base
# Step 4: Launch R
R
# Step 5: Inside R, install packages
install.packages('ggplot2')
install.packages(c('dplyr', 'tidyr', 'readr'))
Installing R Packages on Termux
# Install compilers for source package compilation
pkg install -y clang make cmake
# Grant access to Android device files
termux-setup-storage
# Install pre-compiled CRAN packages (faster — no compilation)
pkg install r-cran-ggplot2
pkg install r-cran-dplyr
# Or install from CRAN inside R
R
install.packages('psych')
install.packages('rio')
RStudio Server on Termux (Advanced)
# Install proot-distro and a full Ubuntu environment
pkg install -y proot-distro
proot-distro install ubuntu
proot-distro login ubuntu
# Inside Ubuntu: install R and RStudio Server
apt update && apt install -y r-base r-base-dev
# Download RStudio Server for your architecture
# Check https://posit.co/download/rstudio-server/ for current URL
wget https://download2.rstudio.org/server/jammy/arm64/rstudio-server-YYYY.MM.x-arm64.deb
dpkg -i rstudio-server-*.deb
rstudio-server start
# Open browser on your phone and go to: http://localhost:8787
Termux Tips
- Persistent sessions: Install
tmux(pkg install tmux) to keep R running when Termux is backgrounded by Android - Extra keys: Swipe from the left edge of Termux to show the extra keys row (Tab, Ctrl, Alt, arrows)
- Bluetooth keyboard: Dramatically improves productivity when writing R scripts on a phone or tablet
- File access: After
termux-setup-storage, your files are accessible at~/storage/shared/ - Performance: Android devices with 4–8 GB RAM run R comfortably in Termux
8. Installing RStudio — The Premier R IDE
RStudio is the Integrated Development Environment (IDE) purpose-built for R. It transforms the R experience into a cohesive, professional development environment that dramatically boosts productivity. Virtually every professional R programmer uses RStudio.
Download URL: https://posit.co/download/rstudio-desktop/ — Free Community Edition. R must be installed first.
What RStudio Provides
- Four-Pane Layout: Script editor, console, environment/history, and files/plots/help in one organized window
- Consistent Shortcuts: Same keyboard shortcuts across Windows, macOS, and Linux
- Syntax Highlighting & Code Completion: Color-coded R code, bracket matching, auto-complete for function names and arguments
- Integrated Debugger: Set breakpoints, step through code, inspect variables at any point
- Git/GitHub Integration: Full version control within the IDE — commit, push, pull without leaving RStudio
- R Markdown & Quarto: Write reproducible documents with live preview and one-click rendering to HTML/PDF/Word
- Data Viewer: Click any data frame to open a sortable, filterable spreadsheet view
- Project System: Organize scripts, data, and outputs into self-contained Projects with automatic working directory management
Installation by Platform
Windows
- Download the Windows
.exeinstaller from posit.co - Run as Administrator — accept all defaults
- RStudio auto-detects your R installation
- To switch between 32-bit and 64-bit R: Tools → Global Options → General → R version → Change
macOS
- Download the
.dmginstaller from posit.co - Open the
.dmgand drag RStudio to your Applications folder - On first launch, click Open if macOS asks for confirmation
Linux
# Ubuntu/Debian — always check posit.co for the current version number
wget https://download1.rstudio.org/electron/jammy/amd64/rstudio-2024.xx.x-xxx.amd64.deb
sudo dpkg -i rstudio-*.amd64.deb
sudo apt --fix-broken install # Fix any missing dependencies
rstudio & # Launch RStudio
Essential RStudio Keyboard Shortcuts
| Action | Windows / Linux | macOS |
|---|---|---|
| Run current line or selection | Ctrl+Enter | Cmd+Enter |
| Insert assignment operator (<-) | Alt + — | Option + — |
| Insert pipe operator (%>%) | Ctrl+Shift+M | Cmd+Shift+M |
| Comment / uncomment lines | Ctrl+Shift+C | Cmd+Shift+C |
| Clear console | Ctrl+L | Ctrl+L |
| New R script | Ctrl+Shift+N | Cmd+Shift+N |
| Save current file | Ctrl+S | Cmd+S |
| Source entire script | Ctrl+Shift+S | Cmd+Shift+S |
| Find and replace | Ctrl+H | Cmd+H |
| Restart R session | Ctrl+Shift+F10 | Cmd+Shift+F10 |
| Zoom Source pane | Ctrl+Shift+1 | Cmd+Shift+1 |
| Zoom Console pane | Ctrl+Shift+2 | Cmd+Shift+2 |
9. Understanding the R & RStudio Interface
RStudio’s Four-Pane Layout
| Pane | Default Position | Purpose |
|---|---|---|
| Source / Script Editor | Top Left | Write, edit, and save R scripts. Multiple files as tabs. Lines starting with # are comments and are not executed. |
| Console | Bottom Left | Shows output. Type R code directly for immediate execution. Displays warnings, errors, and messages. |
| Environment / History | Top Right | Shows all variables and objects in memory with types and sizes. History tab lists all previous commands. |
| Files / Plots / Help / Packages | Bottom Right | Browse files, view plots, search documentation, install and manage packages. |
Key R Syntax Fundamentals
# Lines starting with # are COMMENTS — not executed
# Use comments to document your thinking
# The assignment operator — read as "x gets the value 42"
x <- 42
name <- "R Programming"
# Call a function: functionName(argument1, argument2)
print(x)
sqrt(144)
round(3.14159, digits = 2)
# Get help on any function with ? or help()
?mean
help("lm")
# Access a column inside a data frame with $
iris$Sepal.Length # The Sepal.Length column of iris
# The native pipe operator (R 4.1+) — passes result to next function
iris |> head()
# The magrittr pipe (from dplyr)
iris %>% head()
10. R Packages — Your Superpowers
Packages are bundles of R code, functions, documentation, and data that extend R far beyond its built-in capabilities. With 18,000+ packages on CRAN, whatever you need to do with data, there is almost certainly a package for it.
Base vs. Contributed Packages
| Property | Base Packages | Contributed (Third-Party) |
|---|---|---|
| Origin | Included with R automatically | Downloadable from CRAN, GitHub, Bioconductor |
| Loaded by default? | No — use library() to load (saves memory) | No — must install AND then load each session |
| Examples | datasets, stats, graphics, utils | ggplot2, dplyr, caret, shiny, rmarkdown |
| Quality control | Maintained by R Core Team | CRAN enforces documentation and testing standards |
| Count | ~30 base packages | 18,000+ on CRAN; thousands more on GitHub |
Installing, Loading and Managing Packages
# Install a single package
install.packages('ggplot2')
# Install multiple packages at once
install.packages(c('dplyr', 'tidyr', 'stringr', 'lubridate', 'rio'))
# Install from GitHub (requires remotes package)
install.packages('remotes')
remotes::install_github('tidyverse/ggplot2')
# Load a package (required each session)
library(ggplot2) # Silent — no output if successful
require(dplyr) # Returns TRUE/FALSE — useful inside functions
# Update all installed packages
update.packages(ask = FALSE)
# Unload a package
detach('package:ggplot2', unload = TRUE)
# See all installed packages
installed.packages()[, "Package"]
Using Pacman — The Smart Package Manager
pacman checks if each package is installed, installs those that aren’t, then loads everything — all in one command. It replaces the install.packages() + library() workflow forever.
# Install pacman once
install.packages('pacman')
# p_load: installs if needed, then loads — one command does everything
pacman::p_load(dplyr, ggplot2, tidyr, stringr, lubridate, rio, psych, caret, shiny, rmarkdown)
# Unload all contributed packages
p_unload(all)
# Use a function from a package without loading the whole package
ggplot2::ggplot(data, aes(x, y))
Top 15 Must-Have R Packages
| Package | Purpose | Install Command |
|---|---|---|
| ggplot2 | Grammar of Graphics — gold standard for data visualization. Builds layered, publication-quality charts. | install.packages('ggplot2') |
| dplyr | Fast, intuitive data frame manipulation. Five core verbs: filter, select, mutate, summarize, arrange. | install.packages('dplyr') |
| tidyr | Data cleaning and reshaping. pivot_wider(), pivot_longer(), separate(), unite(). | install.packages('tidyr') |
| stringr | Consistent string manipulation with a unified str_* function family. | install.packages('stringr') |
| lubridate | Makes date/time handling painless. Parse any date format, do date arithmetic. | install.packages('lubridate') |
| rio | Universal import/export. One import() function handles CSV, Excel, SPSS, Stata, JSON, and 40+ formats. | install.packages('rio') |
| caret | Unified interface for 200+ machine learning algorithms with cross-validation and preprocessing. | install.packages('caret') |
| shiny | Build interactive web applications from R. No HTML, CSS, or JavaScript required. | install.packages('shiny') |
| rmarkdown | Reproducible documents combining code, results, and narrative. Renders to HTML, PDF, Word. | install.packages('rmarkdown') |
| psych | Comprehensive descriptive statistics including SD, skewness, kurtosis, SE, and trimmed means. | install.packages('psych') |
| plotly | Interactive web-ready charts. Converts ggplot2 to interactive with one line: ggplotly(p). | install.packages('plotly') |
| data.table | Blazing-fast data manipulation for large datasets. Processes hundreds of millions of rows in seconds. | install.packages('data.table') |
| tidymodels | Modern, unified framework for machine learning with consistent syntax across all model types. | install.packages('tidymodels') |
| httr | HTTP requests and REST API interaction. GET, POST, handle authentication, parse JSON. | install.packages('httr') |
| pacman | Package manager. p_load() installs if missing, then loads — replaces install.packages() + library(). | install.packages('pacman') |
11. Graphics & Data Visualization in R
Data visualization is where R truly shines. The golden rule: always visualize before you analyze. Graphics reveal patterns, outliers, and anomalies that no numerical test will detect.
The Default plot() Command — Smart Adaptation
R’s built-in plot() detects data types and automatically produces the most appropriate chart:
| Input to plot() | Chart Produced | Why This Chart? |
|---|---|---|
| Single categorical variable | Bar chart | Shows distribution of categories |
| Single quantitative variable | Index plot | Shows values against row position |
| Categorical + Quantitative | Box plot per group | Compares distributions across groups |
| Quantitative + Quantitative | Scatter plot | Shows association between two measures |
| Entire data frame | Scatter plot matrix | All pairwise relationships at once |
| Mathematical formula | Line graph | Visualizes the function’s curve |
library(datasets)
data(iris)
# Bar chart — categorical variable
plot(iris$Species)
# Box plot — compare groups
plot(iris$Species, iris$Petal.Length)
# Scatter plot — two quantitative variables
plot(iris$Petal.Length, iris$Petal.Width)
# Full scatter plot matrix — entire data frame
plot(iris)
# Polished customized scatter plot
plot(iris$Petal.Length, iris$Petal.Width,
col = "#CC0000", # Hex color
pch = 19, # Solid circle point
cex = 1.5, # 150% point size
main = "Iris Petal Dimensions",
xlab = "Petal Length (cm)",
ylab = "Petal Width (cm)")
# Add a regression line to an existing plot
abline(lm(Petal.Width ~ Petal.Length, data = iris), col = "blue", lwd = 2)
Bar Charts
data(mtcars)
# IMPORTANT: barplot() needs a frequency table, not raw data!
cylinders <- table(mtcars$cyl) # Create frequency table first
barplot(cylinders,
main = "Cars by Cylinder Count (Motor Trend 1974)",
xlab = "Number of Cylinders",
ylab = "Count",
col = c("#3498DB", "#E74C3C", "#2ECC71"),
border = "white")
Histograms
Key things to look for in a histogram: shape (symmetric vs. skewed), gaps (possible subgroups), outliers, and modality (one peak vs. two or more).
# Basic histogram
hist(iris$Sepal.Length)
# Customized histogram
hist(iris$Sepal.Length,
breaks = 12,
col = "steelblue",
border = "white",
freq = FALSE, # Show density, not count
main = "Sepal Length Distribution",
xlab = "Sepal Length (cm)")
# Overlay a normal distribution curve
curve(dnorm(x, mean = mean(iris$Sepal.Length), sd = sd(iris$Sepal.Length)),
col = "red", lwd = 2, add = TRUE)
# Small multiples — histograms by group
par(mfrow = c(3, 1)) # 3 rows, 1 column layout
hist(iris$Petal.Length[iris$Species == "setosa"],
main = "Setosa — Petal Length", col = "red", xlim = c(0, 8))
hist(iris$Petal.Length[iris$Species == "versicolor"],
main = "Versicolor — Petal Length", col = "purple", xlim = c(0, 8))
hist(iris$Petal.Length[iris$Species == "virginica"],
main = "Virginica — Petal Length", col = "blue", xlim = c(0, 8))
par(mfrow = c(1, 1)) # Always reset layout after
Overlaying Multiple Charts
data(lynx) # Canadian lynx trappings 1821-1934
# Base histogram (density scale)
hist(lynx, breaks = 14, freq = FALSE, col = "thistle1",
main = "Canadian Lynx Trappings 1821-1934")
# Overlay normal distribution
curve(dnorm(x, mean(lynx), sd(lynx)),
col = "darkmagenta", lwd = 2, add = TRUE)
# Overlay kernel density estimator
lines(density(lynx), col = "blue", lwd = 2)
# Overlay rug plot (individual observations as tick marks)
rug(lynx, col = "gray50", lwd = 2)
12. Basic Statistics in R
After visualizing your data, you need numerical precision. R’s statistical functions are concise, accurate, and deeply integrated with its data structures.
library(datasets)
data(iris)
# Summary of a categorical variable — returns frequency counts
summary(iris$Species)
# setosa:50 versicolor:50 virginica:50
# Summary of a quantitative variable — five-number summary + mean
summary(iris$Sepal.Length)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.300 5.100 5.800 5.843 6.400 7.900
# Summary of entire data frame at once
summary(iris)
# Individual statistics
mean(iris$Sepal.Length) # Arithmetic mean
median(iris$Sepal.Length) # Median (robust to outliers)
sd(iris$Sepal.Length) # Standard deviation
var(iris$Sepal.Length) # Variance
range(iris$Sepal.Length) # Min and Max
quantile(iris$Sepal.Length) # All quartiles
IQR(iris$Sepal.Length) # Interquartile range
cor(iris[, 1:4]) # Correlation matrix
# describe() from the psych package — much more detail
library(psych)
describe(iris$Sepal.Length) # n, mean, sd, median, skew, kurtosis, se
describe(iris[, 1:4]) # All quantitative columns
# Subsetting cases
iris[iris$Species == "setosa", ] # All setosa rows
# Multiple conditions
iris[iris$Species == "virginica" & iris$Petal.Length < 5.5, ]
# Save a subset as a new data frame
iris_setosa <- iris[iris$Species == "setosa", ]
# Using dplyr for elegant subsetting
library(dplyr)
iris |>
filter(Species == "versicolor", Petal.Length > 4) |>
select(Species, Petal.Length, Petal.Width) |>
summarize(mean_length = mean(Petal.Length),
sd_length = sd(Petal.Length))
13. Data Types & Structures in R
Data Types
| Type | Description | Example | Check With |
|---|---|---|---|
| numeric (double) | Decimal numbers — R’s default for any number | 3.14, -0.5 |
is.numeric() |
| integer | Whole numbers — append L to create | 1L, 42L |
is.integer() |
| character | Text strings — always quoted | "hello", 'R' |
is.character() |
| logical | Boolean TRUE or FALSE | TRUE, FALSE, T, F |
is.logical() |
| complex | Complex numbers | 3+2i |
is.complex() |
| raw | Raw bytes — rarely used directly | as.raw(0x1F) |
is.raw() |
Data Structures
| Structure | Dimensions | Same Type Required? | Key Use |
|---|---|---|---|
| vector | 1D | Yes | R’s fundamental object. All scalars are length-1 vectors. |
| matrix | 2D | Yes | Numerical computation, linear algebra. |
| array | 3D+ | Yes | Multi-dimensional data (images, tensors). |
| data.frame | 2D | No — columns can differ | The spreadsheet of R. Most analyses start here. |
| list | Any | No — anything goes | Flexible container. Function outputs are often lists. |
| factor | 1D | Yes — integer codes | Categorical data with fixed levels. Essential for modeling. |
# Vector — one-dimensional, all same type
v1 <- c(10, 20, 30, 40)
v2 <- c("apple", "banana", "cherry")
# Matrix — 2D, all same type
m1 <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
# Data Frame — 2D, mixed types per column
df1 <- data.frame(
name = c("Alice", "Bob", "Carol"),
age = c(25, 30, 28),
passed = c(TRUE, FALSE, TRUE),
stringsAsFactors = FALSE
)
# Factor — categorical with fixed levels
os <- factor(c("Windows", "macOS", "Linux", "Windows"))
levels(os) # Shows all unique levels
table(os) # Frequency count per level
# Coercion — convert between types
as.integer(3.9) # Returns 3 (truncates, not rounds)
as.numeric("42") # Returns 42
as.character(100) # Returns "100"
as.logical(0) # Returns FALSE
as.data.frame(m1) # Matrix to data frame
14. Entering & Importing Data
Entering Data Manually
# Assignment operator — read as "x gets the value"
x <- 5
name <- "Hello R"
# In RStudio: Alt+- (Windows) or Option+- (Mac) inserts <-
# Colon operator: sequential integers
0:10 # 0 1 2 3 4 5 6 7 8 9 10
10:0 # 10 9 8 7 6 5 4 3 2 1 0
# seq(): flexible sequences
seq(0, 100, by = 5) # 0 5 10 15 ... 100
seq(1, 0, length.out = 11) # 11 evenly spaced from 1 to 0
# c(): combine values into a vector
ages <- c(23, 45, 31, 67, 29)
names <- c("Ali", "Sara", "Zain", "Priya")
# rep(): repeat values
rep(TRUE, 5) # TRUE TRUE TRUE TRUE TRUE
rep(c("A", "B"), times = 3) # A B A B A B
rep(c("A", "B"), each = 3) # A A A B B B
# scan(): interactive data entry (type values then press Enter twice)
scores <- scan()
Importing External Data
| Format | Extension | rio Command | Base R Command |
|---|---|---|---|
| CSV | .csv | import("file.csv") |
read.csv("file.csv") |
| Tab-delimited | .txt | import("file.txt") |
read.table("file.txt", sep="\t") |
| Excel | .xlsx | import("file.xlsx") |
Requires readxl package |
| SPSS | .sav | import("file.sav") |
Requires haven package |
| Stata | .dta | import("file.dta") |
Requires haven package |
| JSON | .json | import("file.json") |
Requires jsonlite package |
| R native | .rds | readRDS("file.rds") |
readRDS("file.rds") |
library(rio)
# Import any format — rio detects it automatically
data <- import("my_data.csv")
data <- import("my_data.xlsx")
data <- import("my_data.sav")
# Export to any format — even convert between formats
export(data, "output.xlsx") # CSV to Excel
export(data, "output.csv") # Excel to CSV
# Inspect imported data
str(data) # Structure: types and first values
head(data) # First 6 rows
dim(data) # Rows x columns
names(data) # Column names
View(data) # Spreadsheet viewer in RStudio
Excel Import Warning: R’s official documentation advises against importing Excel directly when possible. Export from Excel as CSV first, then import the CSV. This avoids issues with merged cells, hidden rows, formatting, and date encoding problems. The
riopackage handles Excel better than base R, but CSV is always the most reliable format.
15. Statistical Modeling in R
Statistical modeling is R’s deepest strength. The language’s formula notation, unified model objects, and rich ecosystem of modeling packages make it the most expressive environment for quantitative analysis in the world.
Hierarchical Clustering — Finding Similar Cases
Hierarchical clustering groups observations by similarity, building a tree (dendrogram). It is ideal when you don’t know how many clusters exist in advance. Key decisions: distance metric (Euclidean, Manhattan), linkage method (complete, average, Ward’s), and divisive vs. agglomerative approach.
library(datasets)
library(dplyr)
data(mtcars)
# Select meaningful numeric variables
cars <- mtcars[, c(1:4, 6:7, 9:11)]
# Compute hierarchical clustering using pipes
hc <- cars |>
dist() |> # Step 1: Compute pairwise Euclidean distances
hclust() # Step 2: Agglomerative hierarchical clustering
# Plot the dendrogram
plot(hc,
main = "Car Similarity Dendrogram — Motor Trend 1974",
xlab = "",
sub = "",
hang = -1) # Align all labels at the bottom
# Draw colored boxes around k clusters
rect.hclust(hc, k = 2, border = "gray")
rect.hclust(hc, k = 3, border = "blue")
rect.hclust(hc, k = 5, border = "darkred")
# Get cluster membership for each observation
clusters <- cutree(hc, k = 4)
print(clusters)
Principal Component Analysis (PCA) — Dimensionality Reduction
PCA transforms many correlated variables into fewer uncorrelated principal components while retaining as much variation as possible. Think of it as casting a shadow: a 3D object becomes 2D but still conveys the essential shape. Use PCA to reduce noise, identify key dimensions, and visualize high-dimensional data in 2D.
data(mtcars)
cars_sub <- mtcars[, c(1:7, 9:11)]
# Compute PCA
pc <- prcomp(cars_sub,
center = TRUE, # Shift means to zero
scale = TRUE) # Scale each variable to SD = 1
# How much variance does each component explain?
summary(pc)
# Scree plot: visually shows importance of each component
plot(pc, type = "l", main = "Scree Plot")
# Biplot: shows both variables (arrows) and cases (labels)
biplot(pc, main = "PCA Biplot — 1974 Cars", cex = 0.7)
# Loadings: contribution of each original variable to each PC
pc$rotation
Regression Analysis
Regression is the most widely used statistical method in data science. It predicts a continuous outcome from one or more predictor variables. R’s lm() function implements the full linear model with a clean formula interface.
library(datasets)
data(USJudgeRatings)
# Prepare predictors (X) and outcome (Y)
X <- as.matrix(USJudgeRatings[, -12]) # All columns except Retention
Y <- USJudgeRatings[, 12] # Retention: should judge keep job?
# Fit linear regression
model <- lm(Y ~ X)
# Full summary: coefficients, SE, t-values, p-values, R-squared
summary(model)
# ANOVA table
anova(model)
# 95% confidence intervals for all coefficients
confint(model)
# Diagnostic plots (4 panels)
par(mfrow = c(2, 2))
plot(model)
par(mfrow = c(1, 1))
# Residuals histogram
hist(residuals(model), main = "Residuals Distribution", col = "steelblue")
Regression Methods Available in R
| Method | Function / Package | When to Use |
|---|---|---|
| Simple Linear | lm(y ~ x) |
One predictor, continuous outcome |
| Multiple Linear | lm(y ~ x1 + x2 + x3) |
Multiple predictors, continuous outcome |
| Stepwise | step(model) |
Automatic variable selection (AIC-based) |
| Ridge | glmnet(alpha=0) |
High multicollinearity between predictors |
| Lasso | glmnet(alpha=1) |
Variable selection + coefficient shrinkage |
| Elastic Net | glmnet(alpha=0.5) |
Blend of Ridge and Lasso |
| Logistic | glm(family=binomial) |
Binary outcome (yes/no, 0/1) |
| Poisson | glm(family=poisson) |
Count outcomes (number of events) |
| Mixed Effects | lme4::lmer() |
Nested or repeated-measures data |
| Random Forest | randomForest::randomForest() |
Non-linear, robust, high-performance |
| Gradient Boosting | xgboost::xgboost() |
Competition-level predictive modeling |
16. Advantages of R Programming
Cost and Licensing
- Completely Free: Licensed under the GNU GPL. Commercial alternatives (SAS, SPSS, MATLAB) cost $2,000–$15,000+ per user per year. R is and always will be free.
- Open Source Transparency: Every function is inspectable and auditable. Critical for scientific reproducibility — you can verify exactly what any statistical function does.
- No Vendor Lock-In: Your code is fully portable. You are never dependent on a single company’s licensing decisions or product discontinuation.
Statistical Power
- Purpose-Built for Statistics: Every aspect of R’s design reflects its statistical purpose — from formula notation
y ~ xto how data frames handle missing values. - Access to Latest Methods: New statistical methods appear in R packages often on the same day as their academic publication. No other language has this proximity to cutting-edge research.
- Bioconductor Ecosystem: Over 2,000 packages specifically for bioinformatics — genomics, proteomics, single-cell analysis. No equivalent in any other language.
- Comprehensive Coverage: Classical statistics, Bayesian methods, machine learning, survival analysis, spatial analysis, time series, psychometrics, econometrics — R has purpose-built packages for all of it.
Visualization Excellence
- ggplot2 Grammar of Graphics: The most elegant and powerful visualization system in any programming language. Builds complex charts from simple, composable layers with consistent, learnable syntax.
- Publication-Quality Output: The New York Times, The Economist, BBC, and FiveThirtyEight all use R-generated graphics.
- Interactive Visualizations: Shiny, plotly, leaflet, and highcharter create browser-based interactive dashboards with no JavaScript required.
- Infinite Customization: Every pixel — colors, fonts, margins, tick marks, legends, annotations — can be precisely controlled.
Reproducibility and Reporting
- R Markdown: A single .Rmd file contains code, results, and narrative — rendered to HTML, PDF, Word, or slides. Makes science truly reproducible.
- Quarto: Next-generation scientific publishing system supporting R and Python. Produces reports, books, websites, and dashboards.
- Shiny: Deploy fully interactive data apps to the web without a web development team.
Performance and Scalability
- Vectorization: Operations over entire vectors and matrices are as fast as hand-written C code — no Python-style loops needed.
- data.table: Processes billions of rows in seconds. The fastest in-memory data manipulation tool across any language for grouped aggregation.
- sparklyr / arrow: Seamless integration with Apache Spark and Apache Arrow for big-data workflows exceeding RAM.
- Rcpp: Write performance-critical code in C++ and call it from R with zero-friction interoperability.
- Parallel Computing: The
parallel,future, andforeachpackages make multi-core processing straightforward.
Community and Ecosystem
- 18,000+ CRAN Packages: Every package passes quality checks for documentation, examples, and platform compatibility — distinguishing CRAN from PyPI.
- Global R User Groups (RUGs): In-person meetups in hundreds of cities worldwide.
- useR! Conference: Annual international conference with talks available free on YouTube.
- TidyTuesday: Weekly community data visualization challenge — an excellent source of learning examples.
17. Disadvantages of R Programming
Learning Challenges
- Syntax Inconsistency: Base R and the Tidyverse have different syntax philosophies — learners must navigate both worlds.
- Multiple Ways to Do the Same Thing: Base R, data.table, dplyr, and others solve the same problems differently, which confuses beginners.
- Cryptic Error Messages: R’s error messages are often unhelpful and hard for beginners to interpret.
- Requires Statistical Literacy: Using R without understanding statistical assumptions can produce misleading results.
- Inconsistent Function Arguments: Function argument names are not standardized across packages.
Performance Limitations
- RAM-Intensive: R loads all data into memory by default. Datasets larger than available RAM require special packages (ff, bigmemory, arrow) or database backends.
- Single-Threaded by Default: Parallelism requires explicit setup using the parallel, future, or doParallel packages.
- Slow Loops: Pure R for-loops are 100–1000x slower than equivalent C code. Always vectorize or use apply functions.
- Memory Fragmentation: R’s copy-on-modify semantics can cause unexpected memory doubling with large datasets.
Production and Deployment
- Not General-Purpose: R cannot easily build web servers, mobile apps, or desktop applications.
- Limited Web Development: Shiny covers dashboards but cannot build full production web services.
- Deployment Complexity: Production deployment requires plumber (for APIs), Docker (for containerization), and Posit Connect (for enterprise).
- Cannot Build Native Apps: No mobile or desktop application frameworks exist for R.
Package and Ecosystem Issues
- Variable Package Quality: Some CRAN packages are abandoned, poorly documented, or untested.
- Breaking Changes: Popular packages (ggplot2, dplyr) occasionally introduce changes that break existing code.
- Dependency Conflicts: Installing packages can cause version conflicts between their dependencies.
- Narrower Job Market: Python dominates in tech companies. R’s job market is concentrated in academia, pharma, and research.
Summary: R excels when your work centers on statistical analysis, clinical research, academic publication, bioinformatics, or advanced data visualization. For general-purpose software engineering, automation, or deep learning at scale, Python or other languages may serve you better.
18. Common Errors, Resources & Next Steps
Common Errors and How to Fix Them
| Error Message | Cause | Fix |
|---|---|---|
could not find function |
Package not loaded in current session | Add library(packagename) at the top of your script |
object 'x' not found |
Variable doesn’t exist or is misspelled (R is case-sensitive) | Check spelling. Run ls() to see what objects exist. |
subscript out of bounds |
Accessing an index that doesn’t exist | Check length(x) or dim(df) before indexing |
argument is of length zero |
Passing an empty vector or data frame | Guard with if(length(x) > 0) before the operation |
non-conformable arguments |
Matrix dimensions don’t match for the operation | Check dim() on both objects; ensure rows/columns align |
NAs introduced by coercion |
as.numeric() failed on non-numeric characters |
Check data with table(is.na(x)); clean before converting |
| Package install failure | Missing system libraries (Linux) or Rtools (Windows) | Install dev libraries on Linux; install Rtools on Windows (see Sections 4 and 6) |
| R hangs / infinite loop | Code stuck in a loop or waiting for input | Press Escape in RStudio or Ctrl+C in terminal |
figure margins too large |
Plot window too small for the chart | Make the plot pane larger, or reset with par(mar=c(5,4,4,2)) |
unused argument (xyz) |
Invalid argument passed to a function | Check ?functionname for valid argument names |
Essential Learning Resources
- Official R Documentation: cran.r-project.org — manuals, package docs, task views organized by topic
- “R for Data Science” (free online): r4ds.hadley.nz — the definitive Tidyverse guide by Hadley Wickham
- Posit Community Forum: community.rstudio.com — friendly Q&A for R and RStudio
- Stack Overflow R tag: Thousands of answered questions — search your exact error message first
- YouTube — StatQuest with Josh Starmer: The clearest explanations of statistics and R on the internet
- Coursera — Johns Hopkins Data Science Specialization: The most popular R certificate sequence; financial aid available
- TidyTuesday: Weekly community dataset challenge — excellent for building a data visualization portfolio
- R-bloggers: r-bloggers.com — aggregator of R tutorials and news from the community
Your R Learning Roadmap
- Foundation (Week 1–2): Install R and RStudio. Learn basic syntax, data types, vectors, and data frames.
- Tidyverse Core (Week 3–4): Master
dplyrfor data manipulation andggplot2for visualization. These two packages cover 80% of day-to-day work. - Data Import and Cleaning (Week 5): Practice importing CSV, Excel, and other formats. Learn
tidyrfor reshaping messy data. - Statistics and Modeling (Week 6–8): Linear regression with
lm(), descriptive statistics withpsych, and basic hypothesis tests. - Reproducible Reporting (Week 9): Learn R Markdown. Create a report combining analysis code and narrative in a single document.
- Machine Learning (Week 10–12): Explore
tidymodelsfor classification and regression. Practice cross-validation, tuning, and model evaluation. - Shiny Dashboards (Month 4+): Build your first interactive application. Deploy it free on shinyapps.io.
- Community Participation: Post a TidyTuesday entry. Answer a Stack Overflow question. Write a blog post about something you built with R.
Conclusion
R is not merely a programming language — it is the language of data science. For over three decades, R has empowered researchers, analysts, clinicians, and data scientists to extract deep meaning from data with unmatched statistical sophistication and visual elegance.
Whether you are running R on a Windows 64-bit workstation, an Apple Silicon Mac, a Linux server, or your Android phone via Termux — R is ready, free, and fully capable. Its 18,000+ packages, world-class visualization system, and deeply welcoming community ensure that whatever you need to do with data, R can do it.
The journey from your first plot(iris) to deploying a production Shiny dashboard is shorter than you think. Start today.
“In R, there is no ‘if’ — only ‘how.'”
0 Comments