PyTorch Tensors Visual Guide: From Zero to Neural Networks in 30 Minutes
By: Evgeny Padezhnov
PyTorch powers everything from Tesla's self-driving to ChatGPT. But everyone gets stuck on tensors first. I spent two weeks debugging shape errors when I started. Here's what I wish someone showed me on day one.
This guide shows tensor operations visually. You'll understand shapes, operations, and GPU acceleration. Real code examples from production projects included.
What Are Tensors Really
Tensors are just multi-dimensional arrays. Think Excel spreadsheet that can have more than 2 dimensions. PyTorch documentation calls them "the central data abstraction in PyTorch."
Here's the hierarchy:
- 0D tensor = scalar (single number)
- 1D tensor = vector (list of numbers)
- 2D tensor = matrix (table of numbers)
- 3D+ tensor = tensor (cube or hypercube of numbers)
import torch
# 0D - scalar
scalar = torch.tensor(42)
# 1D - vector (5 elements)
vector = torch.tensor([1, 2, 3, 4, 5])
# 2D - matrix (3x3)
matrix = torch.tensor([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# 3D - tensor (2x3x4)
tensor_3d = torch.randn(2, 3, 4)
I use this mental model: spreadsheet with layers. First dimension = number of sheets. Second = rows. Third = columns.
PyTorch tensors beat NumPy arrays for three reasons. GitHub's Tensors-101 guide nails it:
- GPU acceleration - operations millions of times faster
- Automatic differentiation - gradients calculated automatically
- Deep learning ecosystem - built-in losses, optimizers, layers
On practice, I switched from NumPy to PyTorch even for non-neural tasks. GPU speedup on matrix operations is insane.
Default data type is 32-bit float (torch.float32). Takes 4 bytes per number. For class labels use torch.int64 (8 bytes). For mixed precision training - torch.float16 (2 bytes).
# Default float32
x = torch.tensor([1.0, 2.0, 3.0])
print(x.dtype) # torch.float32
# Explicit types
integers = torch.tensor([1, 2, 3], dtype=torch.int64)
half_precision = torch.tensor([1.0, 2.0], dtype=torch.float16)
Creating Tensors: Four Methods That Matter
PyTorch Essentials shows multiple initialization approaches. Here are the ones I actually use daily.
Method 1: From Python lists
# Simple list to tensor
data = [[1, 2, 3], [4, 5, 6]]
tensor = torch.tensor(data)
# From NumPy (super common)
import numpy as np
numpy_array = np.array([[1, 2], [3, 4]])
tensor_from_numpy = torch.from_numpy(numpy_array)
Method 2: Pre-filled tensors
# Zeros - for initialization
zeros = torch.zeros(3, 4) # 3x4 matrix of zeros
# Ones - for masks
ones = torch.ones(2, 3, 4) # 2x3x4 tensor of ones
# Random - for weights initialization
random = torch.rand(5, 5) # values between 0 and 1
Method 3: Like existing tensor
x = torch.tensor([[1, 2], [3, 4]])
# Same shape, different values
zeros_like_x = torch.zeros_like(x)
ones_like_x = torch.ones_like(x)
random_like_x = torch.rand_like(x)
Method 4: Special initializations
# Identity matrix
eye = torch.eye(3) # 3x3 identity
# Range
range_tensor = torch.arange(0, 10, 2) # [0, 2, 4, 6, 8]
# Linspace
linear = torch.linspace(0, 1, 5) # 5 points from 0 to 1
torch.empty() creates uninitialized tensors. Looks random but it's just garbage from memory. Never use for actual calculations - debugging nightmare.
# DON'T DO THIS
bad = torch.empty(2, 3) # Contains random memory garbage
# DO THIS
good = torch.zeros(2, 3) # Properly initialized
Pro tip from debugging sessions: always print tensor shapes after creation. Saves hours of "RuntimeError: shape mismatch" hunting.
Tensor Shapes: The #1 Source of Bugs
"Shape is the single most important property of a tensor. Almost every bug in deep learning is, at its root, a shape mismatch." - straight from Tensors-101 repository. Hundred percent true.
Standard conventions by data type:
- Tabular data:
(batch, features) - Images:
(batch, channels, height, width) - Text/NLP:
(batch, sequence_length, embedding_dim)
# Check shape
tensor = torch.randn(64, 3, 224, 224) # Batch of 64 RGB images
print(tensor.shape) # torch.Size([64, 3, 224, 224])
print(tensor.size()) # Same thing
# Individual dimensions
print(tensor.shape[0]) # 64 (batch size)
print(tensor.shape[1]) # 3 (channels)
Reshaping operations I use every project:
# Original tensor
x = torch.randn(4, 3, 2)
# Reshape (total elements must match)
reshaped = x.reshape(6, 4) # 4*3*2 = 24 = 6*4
# View (same as reshape but shares memory)
viewed = x.view(12, 2)
# Flatten for fully connected layers
flattened = x.flatten() # Shape: [24]
batch_flattened = x.flatten(start_dim=1) # Keep batch: [4, 6]
# Squeeze removes dimensions of size 1
y = torch.randn(1, 3, 1, 5)
squeezed = y.squeeze() # Shape: [3, 5]
# Unsqueeze adds dimension of size 1
z = torch.randn(3, 5)
unsqueezed = z.unsqueeze(0) # Shape: [1, 3, 5]
unsqueezed = z.unsqueeze(-1) # Shape: [3, 5, 1]
Common shape debugging pattern:
def debug_shapes(name, tensor):
print(f"{name}: shape={tensor.shape}, dtype={tensor.dtype}, device={tensor.device}")
# Use everywhere during development
x = torch.randn(32, 10)
debug_shapes("input", x)
model_output = model(x)
debug_shapes("output", model_output)
Shape mismatches happen most often here:
- Batch size mismatch between layers
- Forgetting to flatten before fully connected
- Wrong channel order (channels-first vs channels-last)
- Broadcasting rules confusion
My debugging checklist:
- Print shapes before and after each operation
- Use
assertfor expected shapes in critical places - Keep batch dimension consistent (always first)
- Document expected shapes in comments
GPU Operations: Speed That Changes Everything
Codegenes guide mentions device flexibility. Here's what actually matters for speed.
First, check GPU availability:
# Is CUDA available?
print(torch.cuda.is_available()) # True/False
# How many GPUs?
print(torch.cuda.device_count())
# Current GPU name
if torch.cuda.is_available():
print(torch.cuda.get_device_name(0))
Moving tensors between devices:
# Create on CPU
cpu_tensor = torch.randn(1000, 1000)
# Move to GPU (if available)
if torch.cuda.is_available():
gpu_tensor = cpu_tensor.cuda()
# Or more explicit
gpu_tensor = cpu_tensor.to('cuda')
# Or to specific GPU
gpu_tensor = cpu_tensor.to('cuda:0')
# Back to CPU
back_to_cpu = gpu_tensor.cpu()
Better pattern - device agnostic code:
# Set device once
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
# Create directly on correct device
x = torch.randn(100, 100, device=device)
y = torch.randn(100, 100, device=device)
# Operations stay on same device
z = x @ y # Matrix multiplication on GPU
Golden rule from experience: both tensors must be on same device. This crashes:
# DON'T DO THIS
cpu_tensor = torch.randn(10, 10)
gpu_tensor = torch.randn(10, 10).cuda()
# result = cpu_tensor + gpu_tensor # RuntimeError!
# DO THIS
result = cpu_tensor.cuda() + gpu_tensor
Speed comparison on my RTX 3090:
import time
size = 5000
cpu_a = torch.randn(size, size)
cpu_b = torch.randn(size, size)
# CPU timing
start = time.time()
cpu_result = cpu_a @ cpu_b
cpu_time = time.time() - start
# GPU timing
gpu_a = cpu_a.cuda()
gpu_b = cpu_b.cuda()
torch.cuda.synchronize() # Wait for transfer
start = time.time()
gpu_result = gpu_a @ gpu_b
torch.cuda.synchronize() # Wait for computation
gpu_time = time.time() - start
print(f"CPU: {cpu_time:.4f}s")
print(f"GPU: {gpu_time:.4f}s")
print(f"Speedup: {cpu_time/gpu_time:.1f}x")
Typical results: 50-200x speedup for large matrix operations. Ну chot like that.
Memory management tips:
# Check GPU memory usage
print(f"Allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved()/1024**3:.2f} GB")
# Clear cache when needed
torch.cuda.empty_cache()
# Delete tensors explicitly
del large_tensor
torch.cuda.empty_cache()
Basic Operations That You'll Actually Use
PyTorch documentation covers many operations. Here are the ones I use in every project.
Arithmetic operations:
a = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
b = torch.tensor([[5, 6], [7, 8]], dtype=torch.float32)
# Element-wise operations
add = a + b # or torch.add(a, b)
subtract = a - b
multiply = a * b # element-wise, NOT matrix multiplication
divide = a / b
# In-place operations (save memory)
a.add_(1) # adds 1 to all elements, modifies a
a.mul_(2) # multiplies all by 2
Matrix operations:
# Matrix multiplication - three ways
result1 = a @ b # recommended
result2 = torch.matmul(a, b)
result3 = a.mm(b) # only for 2D tensors
# Batch matrix multiplication
batch_a = torch.randn(32, 10, 20)
batch_b = torch.randn(32, 20, 30)
batch_result = batch_a @ batch_b # Shape: [32, 10, 30]
# Transpose
transposed = a.T # or a.transpose(0, 1)
Reduction operations:
x = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.float32)
# Sum
total_sum = x.sum() # 21
row_sum = x.sum(dim=1) # [6, 15]
col_sum = x.sum(dim=0) # [5, 7, 9]
# Mean
mean_all = x.mean() # 3.5
mean_rows = x.mean(dim=1) # [2, 5]
# Max/Min
max_value = x.max() # 6
max_per_row = x.max(dim=1) # returns (values, indices)
values, indices = x.max(dim=1)
Indexing and slicing:
tensor = torch.randn(3, 4, 5)
# Basic indexing
first_element = tensor[0] # Shape: [4, 5]
specific = tensor[1, 2, 3] # Single value
# Slicing
first_two = tensor[:2] # First 2 along dim 0
middle = tensor[:, 1:3, :] # Slice dim 1
# Advanced indexing
mask = tensor > 0
positive_values = tensor[mask] # 1D tensor of positive values
# Gather specific indices
indices = torch.tensor([0, 2])
selected = torch.index_select(tensor, dim=0, index=indices)
Common patterns from production:
# Clamp values to range
clamped = torch.clamp(tensor, min=0, max=1)
# Replace NaN values
tensor[torch.isnan(tensor)] = 0
# Concatenate tensors
concat_dim0 = torch.cat([tensor1, tensor2], dim=0)
stack = torch.stack([tensor1, tensor2]) # New dimension
# Split tensor
chunks = torch.chunk(tensor, chunks=3, dim=0)
split = torch.split(tensor, split_size_or_sections=2, dim=1)
Common Pitfalls and How to Avoid Them
After debugging hundreds of tensor errors, here are the patterns.
Pitfall 1: Gradient tracking when not needed
# Wrong - tracks gradients unnecessarily
data = torch.randn(100, 100, requires_grad=True)
processed = data * 2 + 1 # Still tracking gradients
# Right - no gradient tracking for data preprocessing
data = torch.randn(100, 100)
# Or explicitly disable
with torch.no_grad():
processed = data * 2 + 1
Pitfall 2: In-place operations breaking gradients
# Wrong - breaks gradient computation
x = torch.randn(5, requires_grad=True)
x += 1 # In-place operation
# Right
x = torch.randn(5, requires_grad=True)
x = x + 1 # New tensor
Pitfall 3: Wrong broadcasting assumptions
# Surprise broadcasting
a = torch.randn(3, 1) # Shape: [3, 1]
b = torch.randn(1, 4) # Shape: [1, 4]
c = a + b # Shape: [3, 4] - might not be intended!
# Be explicit about shapes
a = torch.randn(3, 4)
b = torch.randn(3, 4)
c = a + b # Clear intention
Pitfall 4: Memory leaks with GPU tensors
# Wrong - accumulating GPU memory
losses = []
for batch in dataloader:
loss = model(batch)
losses.append(loss) # Keeps whole computation graph!
# Right
losses = []
for batch in dataloader:
loss = model(batch)
losses.append(loss.item()) # Just the number
Pitfall 5: Assuming contiguous memory
# After transpose, tensor might not be contiguous
x = torch.randn(3, 4)
x_t = x.transpose(0, 1)
print(x_t.is_contiguous()) # False
# Some operations require contiguous
x_t_contiguous = x_t.contiguous()
# Or just use reshape which handles it
x_reshaped = x.reshape(4, 3)
Real Project Examples
Let me show tensor operations from actual production code.
Image preprocessing pipeline:
def preprocess_batch(images, device='cuda'):
"""Images from dataloader to model-ready tensors"""
# images shape: (batch, height, width, channels) - numpy
# Convert to tensor and normalize to [0,1]
tensor_images = torch.from_numpy(images).float() / 255.0
# Rearrange to PyTorch format: (batch, channels, height, width)
tensor_images = tensor_images.permute(0, 3, 1, 2)
# Normalize with ImageNet stats
mean = torch.tensor([0.485, 0.456, 0.406]).view(1, 3, 1, 1)
std = torch.tensor([0.229, 0.224, 0.225]).view(1, 3, 1, 1)
normalized = (tensor_images - mean) / std
return normalized.to(device)
Custom loss with tensor operations:
def focal_loss(predictions, targets, gamma=2.0, alpha=0.25):
"""Focal loss for imbalanced classification"""
# predictions: (batch, num_classes)
# targets: (batch,) - class indices
ce_loss = F.cross_entropy(predictions, targets, reduction='none')
probs = torch.softmax(predictions, dim=1)
# Get probability of correct class
batch_size = targets.shape[0]
probs_correct = probs[torch.arange(batch_size), targets]
# Apply focal term
focal_weight = (1 - probs_correct) ** gamma
focal_loss = alpha * focal_weight * ce_loss
return focal_loss.mean()
Efficient batch processing:
def process_large_dataset(data, model, batch_size=32):
"""Process dataset too large for memory"""
device = next(model.parameters()).device
results = []
# Process in chunks
for i in range(0, len(data), batch_size):
batch = data[i:i+batch_size]
batch_tensor = torch.tensor(batch, device=device)
with torch.no_grad():
output = model(batch_tensor)
results.append(output.cpu()) # Move back to save GPU memory
# Concatenate all results
return torch.cat(results, dim=0)
Tested in practice. These patterns handle 90% of tensor operations in production.
Debugging Tensor Code Effectively
Debugging is half the development time. Here's my workflow.
Step 1: Shape debugging utilities
class TensorDebugger:
def __init__(self, verbose=True):
self.verbose = verbose
def check(self, tensor, name, expected_shape=None):
if self.verbose:
print(f"\n{name}:")
print(f" Shape: {tensor.shape}")
print(f" Device: {tensor.device}")
print(f" Dtype: {tensor.dtype}")
print(f" Min: {tensor.min().item():.4f}")
print(f" Max: {tensor.max().item():.4f}")
print(f" Mean: {tensor.mean().item():.4f}")
if expected_shape:
assert tensor.shape == expected_shape, \
f"Expected {expected_shape}, got {tensor.shape}"
return tensor
# Usage
debug = TensorDebugger()
x = torch.randn(32, 10)
x = debug.check(x, "input", expected_shape=(32, 10))
Step 2: Gradient flow checking
def check_gradients(model):
"""Check if gradients are flowing properly"""
for name, param in model.named_parameters():
if param.grad is not None:
grad_norm = param.grad.norm().item()
param_norm = param.norm().item()
print(f"{name}: grad_norm={grad_norm:.4f}, param_norm={param_norm:.4f}")
if grad_norm == 0:
print(f" WARNING: Zero gradient!")
elif grad_norm > 100:
print(f" WARNING: Large gradient!")
Step 3: Memory profiling
def profile_memory(func):
"""Decorator to profile GPU memory usage"""
def wrapper(*args, **kwargs):
torch.cuda.reset_peak_memory_stats()
torch.cuda.synchronize()
start_memory = torch.cuda.memory_allocated()
result = func(*args, **kwargs)
torch.cuda.synchronize()
end_memory = torch.cuda.memory_allocated()
peak_memory = torch.cuda.max_memory_allocated()
print(f"\nMemory profile for {func.__name__}:")
print(f" Used: {(end_memory - start_memory) / 1024**2:.2f} MB")
print(f" Peak: {peak_memory / 1024**2:.2f} MB")
return result
return wrapper
Common debugging commands:
# Enable anomaly detection (finds where NaN originated)
torch.autograd.set_detect_anomaly(True)
# Print tensor without scientific notation
torch.set_printoptions(precision=4, sci_mode=False)
# Check for NaN/Inf
has_nan = torch.isnan(tensor).any()
has_inf = torch.isinf(tensor).any()
# Find where error occurred
try:
result = risky_operation(tensor)
except RuntimeError as e:
print(f"Error: {e}")
print(f"Input shape: {tensor.shape}")
print(f"Input device: {tensor.device}")
raise
Frequently Asked Questions
How to choose between .reshape() and .view()?
Use .view() when you need memory efficiency and know tensor is contiguous. Use .reshape() when you want it to "just work" - it handles non-contiguous tensors automatically by copying if needed. На practice I default to .reshape() unless optimizing hot paths.
Why does my code work on CPU but crash on GPU?
Most common: tensors on different devices. Always check with .device attribute. Second: out of memory - GPU has limited RAM. Use smaller batches or gradient accumulation. Third: some operations not implemented for CUDA. Check PyTorch docs for specific function.
When to use torch.no_grad()?
Use torch.no_grad() for any code that doesn't need gradients: inference, data preprocessing, metric calculation. Saves memory and speeds up computation. If it works, it's correct. Don't use during training forward pass or gradient will be None.
How to debug "RuntimeError: shape mismatch"?
Print shapes before the failing operation. Use my TensorDebugger class above. Add asserts for expected shapes throughout the code. Most shape errors happen at: batch dimension mismatches, forgotten flatten before linear layer, or wrong dimension in reduction operations.
Conclusion
PyTorch tensors aren't complicated once you understand shapes and devices. Start with CPU, move to GPU when you need speed. Always check shapes. Use the debugging utilities I showed.
The 20% of operations I covered handle 80% of use cases. Matrix multiplication, reshaping, and device management - master these first. The rest comes naturally.
Next step: build something. Even simple matrix operations on GPU will show you the speed difference. That "aha" moment when you see 100x speedup makes everything click.