PyTorch Tensors Visual Guide: From Zero to Neural Networks in 30 Minutes

PyTorch powers everything from Tesla's self-driving to ChatGPT. But everyone gets stuck on tensors first. I spent two weeks debugging shape errors when I started. Here's what I wish someone showed me on day one.

This guide shows tensor operations visually. You'll understand shapes, operations, and GPU acceleration. Real code examples from production projects included.

What Are Tensors Really

Tensors are just multi-dimensional arrays. Think Excel spreadsheet that can have more than 2 dimensions. PyTorch documentation calls them "the central data abstraction in PyTorch."

Here's the hierarchy:

0D tensor = scalar (single number)
1D tensor = vector (list of numbers)
2D tensor = matrix (table of numbers)
3D+ tensor = tensor (cube or hypercube of numbers)

import torch

# 0D - scalar
scalar = torch.tensor(42)

# 1D - vector (5 elements)
vector = torch.tensor([1, 2, 3, 4, 5])

# 2D - matrix (3x3)
matrix = torch.tensor([[1, 2, 3],
                       [4, 5, 6],
                       [7, 8, 9]])

# 3D - tensor (2x3x4)
tensor_3d = torch.randn(2, 3, 4)

I use this mental model: spreadsheet with layers. First dimension = number of sheets. Second = rows. Third = columns.

PyTorch tensors beat NumPy arrays for three reasons. GitHub's Tensors-101 guide nails it:

GPU acceleration - operations millions of times faster
Automatic differentiation - gradients calculated automatically
Deep learning ecosystem - built-in losses, optimizers, layers

On practice, I switched from NumPy to PyTorch even for non-neural tasks. GPU speedup on matrix operations is insane.

Default data type is 32-bit float (torch.float32). Takes 4 bytes per number. For class labels use torch.int64 (8 bytes). For mixed precision training - torch.float16 (2 bytes).

# Default float32
x = torch.tensor([1.0, 2.0, 3.0])
print(x.dtype)  # torch.float32

# Explicit types
integers = torch.tensor([1, 2, 3], dtype=torch.int64)
half_precision = torch.tensor([1.0, 2.0], dtype=torch.float16)

Creating Tensors: Four Methods That Matter

PyTorch Essentials shows multiple initialization approaches. Here are the ones I actually use daily.

Method 1: From Python lists

# Simple list to tensor
data = [[1, 2, 3], [4, 5, 6]]
tensor = torch.tensor(data)

# From NumPy (super common)
import numpy as np
numpy_array = np.array([[1, 2], [3, 4]])
tensor_from_numpy = torch.from_numpy(numpy_array)

Method 2: Pre-filled tensors

# Zeros - for initialization
zeros = torch.zeros(3, 4)  # 3x4 matrix of zeros

# Ones - for masks
ones = torch.ones(2, 3, 4)  # 2x3x4 tensor of ones

# Random - for weights initialization  
random = torch.rand(5, 5)  # values between 0 and 1

Method 3: Like existing tensor

x = torch.tensor([[1, 2], [3, 4]])

# Same shape, different values
zeros_like_x = torch.zeros_like(x)
ones_like_x = torch.ones_like(x)
random_like_x = torch.rand_like(x)

Method 4: Special initializations

# Identity matrix
eye = torch.eye(3)  # 3x3 identity

# Range
range_tensor = torch.arange(0, 10, 2)  # [0, 2, 4, 6, 8]

# Linspace  
linear = torch.linspace(0, 1, 5)  # 5 points from 0 to 1

torch.empty() creates uninitialized tensors. Looks random but it's just garbage from memory. Never use for actual calculations - debugging nightmare.

# DON'T DO THIS
bad = torch.empty(2, 3)  # Contains random memory garbage

# DO THIS
good = torch.zeros(2, 3)  # Properly initialized

Pro tip from debugging sessions: always print tensor shapes after creation. Saves hours of "RuntimeError: shape mismatch" hunting.

Tensor Shapes: The #1 Source of Bugs

"Shape is the single most important property of a tensor. Almost every bug in deep learning is, at its root, a shape mismatch." - straight from Tensors-101 repository. Hundred percent true.

Standard conventions by data type:

Tabular data: (batch, features)
Images: (batch, channels, height, width)
Text/NLP: (batch, sequence_length, embedding_dim)

# Check shape
tensor = torch.randn(64, 3, 224, 224)  # Batch of 64 RGB images
print(tensor.shape)  # torch.Size([64, 3, 224, 224])
print(tensor.size())  # Same thing

# Individual dimensions
print(tensor.shape[0])  # 64 (batch size)
print(tensor.shape[1])  # 3 (channels)

Reshaping operations I use every project:

# Original tensor
x = torch.randn(4, 3, 2)

# Reshape (total elements must match)
reshaped = x.reshape(6, 4)  # 4*3*2 = 24 = 6*4

# View (same as reshape but shares memory)
viewed = x.view(12, 2)

# Flatten for fully connected layers
flattened = x.flatten()  # Shape: [24]
batch_flattened = x.flatten(start_dim=1)  # Keep batch: [4, 6]

# Squeeze removes dimensions of size 1
y = torch.randn(1, 3, 1, 5)
squeezed = y.squeeze()  # Shape: [3, 5]

# Unsqueeze adds dimension of size 1
z = torch.randn(3, 5)
unsqueezed = z.unsqueeze(0)  # Shape: [1, 3, 5] 
unsqueezed = z.unsqueeze(-1)  # Shape: [3, 5, 1]

Common shape debugging pattern:

def debug_shapes(name, tensor):
    print(f"{name}: shape={tensor.shape}, dtype={tensor.dtype}, device={tensor.device}")

# Use everywhere during development
x = torch.randn(32, 10)
debug_shapes("input", x)

model_output = model(x)
debug_shapes("output", model_output)

Shape mismatches happen most often here:

Batch size mismatch between layers
Forgetting to flatten before fully connected
Wrong channel order (channels-first vs channels-last)
Broadcasting rules confusion

My debugging checklist:

Print shapes before and after each operation
Use assert for expected shapes in critical places
Keep batch dimension consistent (always first)
Document expected shapes in comments

GPU Operations: Speed That Changes Everything

Codegenes guide mentions device flexibility. Here's what actually matters for speed.

First, check GPU availability:

# Is CUDA available?
print(torch.cuda.is_available())  # True/False

# How many GPUs?
print(torch.cuda.device_count())

# Current GPU name
if torch.cuda.is_available():
    print(torch.cuda.get_device_name(0))

Moving tensors between devices:

# Create on CPU
cpu_tensor = torch.randn(1000, 1000)

# Move to GPU (if available)
if torch.cuda.is_available():
    gpu_tensor = cpu_tensor.cuda()
    # Or more explicit
    gpu_tensor = cpu_tensor.to('cuda')
    # Or to specific GPU
    gpu_tensor = cpu_tensor.to('cuda:0')

# Back to CPU  
back_to_cpu = gpu_tensor.cpu()

Better pattern - device agnostic code:

# Set device once
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Create directly on correct device
x = torch.randn(100, 100, device=device)
y = torch.randn(100, 100, device=device)

# Operations stay on same device
z = x @ y  # Matrix multiplication on GPU

Golden rule from experience: both tensors must be on same device. This crashes:

# DON'T DO THIS
cpu_tensor = torch.randn(10, 10)
gpu_tensor = torch.randn(10, 10).cuda()
# result = cpu_tensor + gpu_tensor  # RuntimeError!

# DO THIS
result = cpu_tensor.cuda() + gpu_tensor

Speed comparison on my RTX 3090:

import time

size = 5000
cpu_a = torch.randn(size, size)
cpu_b = torch.randn(size, size)

# CPU timing
start = time.time()
cpu_result = cpu_a @ cpu_b
cpu_time = time.time() - start

# GPU timing
gpu_a = cpu_a.cuda()
gpu_b = cpu_b.cuda()
torch.cuda.synchronize()  # Wait for transfer
start = time.time()
gpu_result = gpu_a @ gpu_b
torch.cuda.synchronize()  # Wait for computation
gpu_time = time.time() - start

print(f"CPU: {cpu_time:.4f}s")
print(f"GPU: {gpu_time:.4f}s")
print(f"Speedup: {cpu_time/gpu_time:.1f}x")

Typical results: 50-200x speedup for large matrix operations. Ну chot like that.

Memory management tips:

# Check GPU memory usage
print(f"Allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved()/1024**3:.2f} GB")

# Clear cache when needed
torch.cuda.empty_cache()

# Delete tensors explicitly
del large_tensor
torch.cuda.empty_cache()

Basic Operations That You'll Actually Use

PyTorch documentation covers many operations. Here are the ones I use in every project.

Arithmetic operations:

a = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
b = torch.tensor([[5, 6], [7, 8]], dtype=torch.float32)

# Element-wise operations
add = a + b  # or torch.add(a, b)
subtract = a - b
multiply = a * b  # element-wise, NOT matrix multiplication
divide = a / b

# In-place operations (save memory)
a.add_(1)  # adds 1 to all elements, modifies a
a.mul_(2)  # multiplies all by 2

Matrix operations:

# Matrix multiplication - three ways
result1 = a @ b  # recommended
result2 = torch.matmul(a, b)
result3 = a.mm(b)  # only for 2D tensors

# Batch matrix multiplication
batch_a = torch.randn(32, 10, 20)
batch_b = torch.randn(32, 20, 30)
batch_result = batch_a @ batch_b  # Shape: [32, 10, 30]

# Transpose
transposed = a.T  # or a.transpose(0, 1)

Reduction operations:

x = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.float32)

# Sum
total_sum = x.sum()  # 21
row_sum = x.sum(dim=1)  # [6, 15]
col_sum = x.sum(dim=0)  # [5, 7, 9]

# Mean
mean_all = x.mean()  # 3.5
mean_rows = x.mean(dim=1)  # [2, 5]

# Max/Min
max_value = x.max()  # 6
max_per_row = x.max(dim=1)  # returns (values, indices)
values, indices = x.max(dim=1)

Indexing and slicing:

tensor = torch.randn(3, 4, 5)

# Basic indexing
first_element = tensor[0]  # Shape: [4, 5]
specific = tensor[1, 2, 3]  # Single value

# Slicing
first_two = tensor[:2]  # First 2 along dim 0
middle = tensor[:, 1:3, :]  # Slice dim 1

# Advanced indexing
mask = tensor > 0
positive_values = tensor[mask]  # 1D tensor of positive values

# Gather specific indices
indices = torch.tensor([0, 2])
selected = torch.index_select(tensor, dim=0, index=indices)

Common patterns from production:

# Clamp values to range
clamped = torch.clamp(tensor, min=0, max=1)

# Replace NaN values
tensor[torch.isnan(tensor)] = 0

# Concatenate tensors
concat_dim0 = torch.cat([tensor1, tensor2], dim=0)
stack = torch.stack([tensor1, tensor2])  # New dimension

# Split tensor
chunks = torch.chunk(tensor, chunks=3, dim=0)
split = torch.split(tensor, split_size_or_sections=2, dim=1)

Common Pitfalls and How to Avoid Them

After debugging hundreds of tensor errors, here are the patterns.

Pitfall 1: Gradient tracking when not needed

# Wrong - tracks gradients unnecessarily
data = torch.randn(100, 100, requires_grad=True)
processed = data * 2 + 1  # Still tracking gradients

# Right - no gradient tracking for data preprocessing
data = torch.randn(100, 100)
# Or explicitly disable
with torch.no_grad():
    processed = data * 2 + 1

Pitfall 2: In-place operations breaking gradients

# Wrong - breaks gradient computation
x = torch.randn(5, requires_grad=True)
x += 1  # In-place operation

# Right
x = torch.randn(5, requires_grad=True)
x = x + 1  # New tensor

Pitfall 3: Wrong broadcasting assumptions

# Surprise broadcasting
a = torch.randn(3, 1)  # Shape: [3, 1]
b = torch.randn(1, 4)  # Shape: [1, 4]
c = a + b  # Shape: [3, 4] - might not be intended!

# Be explicit about shapes
a = torch.randn(3, 4)
b = torch.randn(3, 4)
c = a + b  # Clear intention

Pitfall 4: Memory leaks with GPU tensors

# Wrong - accumulating GPU memory
losses = []
for batch in dataloader:
    loss = model(batch)
    losses.append(loss)  # Keeps whole computation graph!

# Right
losses = []
for batch in dataloader:
    loss = model(batch)
    losses.append(loss.item())  # Just the number

Pitfall 5: Assuming contiguous memory

# After transpose, tensor might not be contiguous
x = torch.randn(3, 4)
x_t = x.transpose(0, 1)
print(x_t.is_contiguous())  # False

# Some operations require contiguous
x_t_contiguous = x_t.contiguous()
# Or just use reshape which handles it
x_reshaped = x.reshape(4, 3)

Real Project Examples

Let me show tensor operations from actual production code.

Image preprocessing pipeline:

def preprocess_batch(images, device='cuda'):
    """Images from dataloader to model-ready tensors"""
    # images shape: (batch, height, width, channels) - numpy
    
    # Convert to tensor and normalize to [0,1]
    tensor_images = torch.from_numpy(images).float() / 255.0
    
    # Rearrange to PyTorch format: (batch, channels, height, width)
    tensor_images = tensor_images.permute(0, 3, 1, 2)
    
    # Normalize with ImageNet stats
    mean = torch.tensor([0.485, 0.456, 0.406]).view(1, 3, 1, 1)
    std = torch.tensor([0.229, 0.224, 0.225]).view(1, 3, 1, 1)
    
    normalized = (tensor_images - mean) / std
    
    return normalized.to(device)

Custom loss with tensor operations:

def focal_loss(predictions, targets, gamma=2.0, alpha=0.25):
    """Focal loss for imbalanced classification"""
    # predictions: (batch, num_classes)
    # targets: (batch,) - class indices
    
    ce_loss = F.cross_entropy(predictions, targets, reduction='none')
    probs = torch.softmax(predictions, dim=1)
    
    # Get probability of correct class
    batch_size = targets.shape[0]
    probs_correct = probs[torch.arange(batch_size), targets]
    
    # Apply focal term
    focal_weight = (1 - probs_correct) ** gamma
    focal_loss = alpha * focal_weight * ce_loss
    
    return focal_loss.mean()

Efficient batch processing:

def process_large_dataset(data, model, batch_size=32):
    """Process dataset too large for memory"""
    device = next(model.parameters()).device
    results = []
    
    # Process in chunks
    for i in range(0, len(data), batch_size):
        batch = data[i:i+batch_size]
        batch_tensor = torch.tensor(batch, device=device)
        
        with torch.no_grad():
            output = model(batch_tensor)
            results.append(output.cpu())  # Move back to save GPU memory
    
    # Concatenate all results
    return torch.cat(results, dim=0)

Tested in practice. These patterns handle 90% of tensor operations in production.

Debugging Tensor Code Effectively

Debugging is half the development time. Here's my workflow.

Step 1: Shape debugging utilities

class TensorDebugger:
    def __init__(self, verbose=True):
        self.verbose = verbose
        
    def check(self, tensor, name, expected_shape=None):
        if self.verbose:
            print(f"\n{name}:")
            print(f"  Shape: {tensor.shape}")
            print(f"  Device: {tensor.device}")
            print(f"  Dtype: {tensor.dtype}")
            print(f"  Min: {tensor.min().item():.4f}")
            print(f"  Max: {tensor.max().item():.4f}")
            print(f"  Mean: {tensor.mean().item():.4f}")
            
        if expected_shape:
            assert tensor.shape == expected_shape, \
                f"Expected {expected_shape}, got {tensor.shape}"
        
        return tensor

# Usage
debug = TensorDebugger()
x = torch.randn(32, 10)
x = debug.check(x, "input", expected_shape=(32, 10))

Step 2: Gradient flow checking

def check_gradients(model):
    """Check if gradients are flowing properly"""
    for name, param in model.named_parameters():
        if param.grad is not None:
            grad_norm = param.grad.norm().item()
            param_norm = param.norm().item()
            print(f"{name}: grad_norm={grad_norm:.4f}, param_norm={param_norm:.4f}")
            
            if grad_norm == 0:
                print(f"  WARNING: Zero gradient!")
            elif grad_norm > 100:
                print(f"  WARNING: Large gradient!")

Step 3: Memory profiling

def profile_memory(func):
    """Decorator to profile GPU memory usage"""
    def wrapper(*args, **kwargs):
        torch.cuda.reset_peak_memory_stats()
        torch.cuda.synchronize()
        start_memory = torch.cuda.memory_allocated()
        
        result = func(*args, **kwargs)
        
        torch.cuda.synchronize()
        end_memory = torch.cuda.memory_allocated()
        peak_memory = torch.cuda.max_memory_allocated()
        
        print(f"\nMemory profile for {func.__name__}:")
        print(f"  Used: {(end_memory - start_memory) / 1024**2:.2f} MB")
        print(f"  Peak: {peak_memory / 1024**2:.2f} MB")
        
        return result
    return wrapper

Common debugging commands:

# Enable anomaly detection (finds where NaN originated)
torch.autograd.set_detect_anomaly(True)

# Print tensor without scientific notation
torch.set_printoptions(precision=4, sci_mode=False)

# Check for NaN/Inf
has_nan = torch.isnan(tensor).any()
has_inf = torch.isinf(tensor).any()

# Find where error occurred
try:
    result = risky_operation(tensor)
except RuntimeError as e:
    print(f"Error: {e}")
    print(f"Input shape: {tensor.shape}")
    print(f"Input device: {tensor.device}")
    raise

Frequently Asked Questions

How to choose between .reshape() and .view()?

Use .view() when you need memory efficiency and know tensor is contiguous. Use .reshape() when you want it to "just work" - it handles non-contiguous tensors automatically by copying if needed. На practice I default to .reshape() unless optimizing hot paths.

Why does my code work on CPU but crash on GPU?

Most common: tensors on different devices. Always check with .device attribute. Second: out of memory - GPU has limited RAM. Use smaller batches or gradient accumulation. Third: some operations not implemented for CUDA. Check PyTorch docs for specific function.

When to use torch.no_grad()?

Use torch.no_grad() for any code that doesn't need gradients: inference, data preprocessing, metric calculation. Saves memory and speeds up computation. If it works, it's correct. Don't use during training forward pass or gradient will be None.

How to debug "RuntimeError: shape mismatch"?

Print shapes before the failing operation. Use my TensorDebugger class above. Add asserts for expected shapes throughout the code. Most shape errors happen at: batch dimension mismatches, forgotten flatten before linear layer, or wrong dimension in reduction operations.

Conclusion

PyTorch tensors aren't complicated once you understand shapes and devices. Start with CPU, move to GPU when you need speed. Always check shapes. Use the debugging utilities I showed.

The 20% of operations I covered handle 80% of use cases. Matrix multiplication, reshaping, and device management - master these first. The rest comes naturally.

Next step: build something. Even simple matrix operations on GPU will show you the speed difference. That "aha" moment when you see 100x speedup makes everything click.