Skip to main content

Command documentation sourced from the linux-command project This comprehensive command reference is part of the linux-command documentation project.

split - Split a File into Pieces

The split command is a powerful utility that splits large files into smaller, more manageable pieces based on size, line count, or content patterns. It's an essential tool for file management, data processing, backup operations, and overcoming file transfer limitations. Split supports various splitting modes, flexible naming conventions, and integrates seamlessly with other Unix tools through pipelines. It's particularly useful for handling large log files, database dumps, backup archives, and datasets that exceed size limits for storage media or network transfers.

Basic Syntax

split [OPTION]... [INPUT [PREFIX]]

Options Overview

Size-based Splitting Options

  • -b, --bytes=SIZE - Create output files of exactly SIZE bytes
  • -C, --line-bytes=SIZE - Put at most SIZE bytes of lines per output file (doesn't split lines)
  • --filter=COMMAND - Write to shell COMMAND; file name is $FILE

Line-based Splitting Options

  • -l, --lines=NUMBER - Put NUMBER lines/records per output file
  • --lines=NUMBER - Same as -l, more verbose form

Pattern-based Splitting Options

  • -t, --separator=SEP - Use SEP instead of newline as the record separator
  • --elide-empty-files - Do not generate empty output files with -n

Naming and Output Options

  • -d, --numeric-suffixes[=FROM] - Use numeric suffixes instead of alphabetic
  • -a, --suffix-length=N - Use suffixes of length N (default 2)
  • --additional-suffix=SUFFIX - Append an additional SUFFIX to files
  • --numeric-suffixes=FROM:TO - Use numeric suffixes from FROM to TO

Advanced Splitting Options

  • -n, --number=CHUNKS - Generate CHUNKS output files
  • -n, --number=K/N - Output Kth of N to standard output
  • -n, --number=l/K/N - Kth of N to standard output without分割lines
  • -x, --hex-suffixes[=FROM] - Use hexadecimal suffixes instead of alphabetic

Information and Help

  • --help - Display help message and exit
  • --version - Output version information and exit

Size Specifications

SIZE can have multiplier suffixes for convenience:

Standard Multipliers

  • b - 512-byte blocks
  • KB - 1000 bytes (kilobyte)
  • K - 1024 bytes (kibibyte)
  • MB - 1,000,000 bytes (megabyte)
  • M - 1,048,576 bytes (mebibyte)
  • GB - 1,000,000,000 bytes (gigabyte)
  • G - 1,073,741,824 bytes (gibibyte)

Large File Support

  • TB - 1,000,000,000,000 bytes (terabyte)
  • T - 1,099,511,627,776 bytes (tebibyte)
  • PB - 1,000,000,000,000,000 bytes (petabyte)
  • P - 1,125,899,906,842,624 bytes (pebibyte)

Usage Examples

Basic File Splitting

Size-based Splitting

# Split file into 1MB pieces
split -b 1M large_file.txt

# Split into 100KB pieces with custom prefix
split -b 100K database.sql db_backup_

# Split into exact byte sizes
split -b 1048576 video.mp4 video_chunk_

# Split with decimal sizes
split -b 1.5M document.pdf doc_part_

Line-based Splitting

# Split into 1000 line chunks
split -l 1000 access_log.txt log_part_

# Split CSV into manageable chunks
split -l 50000 large_dataset.csv csv_batch_

# Split with custom prefix and line count
split -l 10000 user_list.txt users_batch_

# Split configuration file by sections
awk '/^# Section/ {section++} {print > "section_" section ".txt"}' config.conf

Advanced Splitting Patterns

# Split with line preservation (never cuts lines)
split -C 1M text_file.txt text_chunk_

# Split into exactly 5 files of equal size
split -n 5 large_file.txt chunk_

# Split 3rd part of 10 total to stdout
split -n 3/10 large_file.txt > third_part.txt

# Split with custom separator
split -t ',' -l 1000 csv_file.csv csv_part_

File Naming and Organization

Custom Naming Schemes

# Use numeric suffixes
split -d -b 1M data.dat part_

# Use longer suffixes for many files
split -d -a 5 -b 10M huge_file.bin part_

# Use hexadecimal suffixes
split -x -b 1M binary_data.bin hex_part_

# Start numbering from specific value
split -d --numeric-suffixes=100 -b 1M file.txt part_

# Add additional suffix
split -b 100K log.txt backup --additional-suffix=.txt

File Organization Patterns

# Split with date-based prefixes
split -b 500M backup.tar.gz "backup_$(date +%Y%m%d)_"

# Split with descriptive names
split -l 1000 contacts.csv contacts_batch_

# Split for CD/DVD burning
split -b 700M large_archive.iso cd_image_
split -b 4.7G video_project.mkv dvd_part_

Practical Examples

System Administration

Log File Management

# Split daily logs into hourly chunks
split -l 50000 access_log.txt hourly_access_

# Split large error logs by size
split -b 50M application.log error_part_

# Split logs for archival with dates
split -b 100M /var/log/syslog "syslog_$(date +%Y%m%d)_"

# Rotate and split logs in one operation
cp /var/log/app.log /var/log/app.log.$(date +%Y%m%d)
split -b 10M /var/log/app.log.$(date +%Y%m%d) "app_log_$(date +%Y%m%d)_"
> /var/log/app.log

Backup Operations

# Split large backup into manageable pieces
tar -czf backup.tar.gz /home/user/documents
split -b 100M backup.tar.gz backup_part_

# Create DVD-sized backup pieces
split -b 4.7G system_backup.iso dvd_image_

# Split database dump for transfer
mysqldump -u user -p database > db_dump.sql
split -l 100000 db_dump.sql db_part_

# Split user home directory for migration
tar -cf - /home/user | split -b 1G - user_backup_

System Maintenance

# Split package files for distribution
split -b 650M software.tar.gz cd_image_

# Split mail spools for processing
split -l 50000 /var/mail/username mail_part_

# Split configuration files for analysis
split -p '^# Section' /etc/nginx/nginx.conf nginx_conf_

# Split binary logs for database recovery
split -b 100M mysql-bin.log binlog_part_

Data Processing and Analysis

Large Dataset Processing

# Split CSV for parallel processing
split -l 10000 large_dataset.csv csv_part_

# Process split files in parallel
split -l 100000 huge_file.txt chunk_
for chunk in chunk_*; do
process_chunk "$chunk" &
done
wait

# Split XML files by records
split -p '</record>' -l 1000 large_data.xml data_part_

# Split JSON arrays for processing
jq -c '.[]' large_array.json | split -l 10000 - json_part_

Data Migration

# Split database exports for import
mysqldump database | split -l 100000 - db_export_

# Split large text files for FTP transfer
split -b 2GB huge_dataset.txt transfer_part_

# Split by content patterns
awk '/^--- NEW RECORD ---/ {if (filename) close(filename); filename="record_" ++count ".txt"} {print > filename}' data.txt

# Split configuration for deployment
split -c 1024 production.conf deploy_part_

Network and File Transfer

Email and Attachment Splitting

# Split for email attachment limits
split -b 20M large_file.zip email_part_

# Create multipart email attachments
split -b 10M document.pdf doc_part_
for file in doc_part_*; do
echo "Attaching $file to email"
# mail -s "Attachment $file" recipient@example.com < "$file"
done

# Split for cloud storage limits
split -b 100M large_dataset.txt cloud_part_

Download and Transfer Optimization

# Create resume-friendly downloads
split -b 10MB download.zip part_

# Generate download script for split files
split -b 50M large_file.iso download_part_
cat > download_script.sh << 'EOF'
#!/bin/bash
for file in download_part_*; do
echo "wget http://example.com/$file"
done > download_commands.sh
chmod +x download_commands.sh
EOF

# Split for parallel uploads
split -b 100M data.tar.gz upload_part_
for part in upload_part_*; do
upload_to_cloud "$part" &
done
wait

Advanced Usage

Custom Splitting Logic

Pattern-based Splitting

# Split by content patterns using awk
awk '/^ERROR/ {print > "error.log"} !/ERROR/ {print > "normal.log"}' app.log

# Split by date ranges
awk '$1 >= "2023-01-01" && $1 < "2023-01-02"' all_logs.txt | split -l 10000 - jan1_logs_

# Split by file size patterns
find . -name "*.log" -size +100M -exec split -b 50M {} {}_part_ \;

# Split by custom separators
split -t '###' -l 1000 config_sections.txt section_

Pipeline Integration

# Split and compress simultaneously
split -b 100M large_file.txt | gzip > compressed_parts.gz

# Split from network stream
curl http://example.com/largefile.zip | split -b 1M - download_part_

# Split output from database query
mysql -u user -p -e "SELECT * FROM large_table" database | split -l 50000 - query_result_

# Split and process in pipeline
cat huge_file.txt | split -l 1000 - | while read chunk; do
process_chunk "$chunk"
done

Quality Control and Verification

Integrity Checking

# Verify split integrity by line count
wc -l original_file.txt
awk '{sum+=$1} END {print "Total lines:", sum}' temp_* | tail -1

# Check byte accuracy
wc -c original_file.txt
cat part_* | wc -c

# Generate checksums for split files
split -b 1GB backup.tar.gz part_
for file in part_*; do
md5sum "$file" > "${file}.md5"
done

# Verify checksums after transfer
for file in part_*; do
md5sum -c "${file}.md5"
done

Validation Scripts

#!/bin/bash
# Validate split file integrity
ORIGINAL_FILE="original.txt"
CHUNK_PREFIX="chunk_"

# Get original file info
ORIGINAL_SIZE=$(stat -f%z "$ORIGINAL_FILE" 2>/dev/null || stat -c%s "$ORIGINAL_FILE")
ORIGINAL_LINES=$(wc -l < "$ORIGINAL_FILE")

# Check reassembled file
cat ${CHUNK_PREFIX}* > reassembled.txt
REASSEMBLED_SIZE=$(stat -f%z reassembled.txt 2>/dev/null || stat -c%s reassembled.txt)
REASSEMBLED_LINES=$(wc -l < reassembled.txt)

# Compare results
echo "Original size: $ORIGINAL_SIZE, Reassembled size: $REASSEMBLED_SIZE"
echo "Original lines: $ORIGINAL_LINES, Reassembled lines: $REASSEMBLED_LINES"

if [ "$ORIGINAL_SIZE" -eq "$REASSEMBLED_SIZE" ] && [ "$ORIGINAL_LINES" -eq "$REASSEMBLED_LINES" ]; then
echo "✅ Split verification successful"
else
echo "❌ Split verification failed"
fi

Special Use Cases

Media File Processing

Video and Audio Splitting

# Split video files for streaming
split -b 100M large_video.mp4 video_chunk_

# Split audio files by chapters (requires metadata parsing)
ffmpeg -i audiobook.mp3 -f segment -segment_time 300 -c copy chapter_%03d.mp3

# Split subtitle files
split -l 100 subtitles.srt subtitle_part_

# Split playlists
split -l 5000 music_playlist.m3u playlist_part_

Image and Document Processing

# Split large PDF files (requires pdftk)
pdftk large.pdf cat output - | split -b 1M - pdf_chunk_

# Split image archives
tar -cf - photos/ | split -b 500M - photo_archive_

# Split document collections by type
find . -name "*.pdf" -exec split -b 10M {} {}_part_ \;

Database Operations

Database Export Splitting

# Split MySQL export by table
for table in $(mysql -u user -p -e "SHOW TABLES" database); do
mysqldump -u user -p database "$table" | split -l 100000 "${table}_dump_"
done

# Split PostgreSQL dump
pg_dump database | split -b 100M pg_dump_part_

# Split MongoDB export
mongoexport --db database --collection data | split -l 50000 mongo_export_

# Split Redis data backup
redis-cli --rdb backup.rdb
split -b 50M backup.rdb redis_backup_part_

Development and Testing

Test Data Generation

# Split test datasets for distributed testing
split -l 1000 test_data.txt test_suite_

# Split configuration files for different environments
split -p 'environment:' config.json env_config_

# Split source code for analysis
find . -name "*.py" -exec cat {} \; | split -l 5000 python_code_

# Split API responses for processing
curl -s "https://api.example.com/large-data" | split -l 1000 - api_data_

Performance Optimization

Large File Handling

Memory-efficient Splitting

# Split with minimal memory usage
split -b 1G huge_file.txt part_

# Use appropriate buffer sizes
split -b 1M --buffer-size=1M large_file.txt

# Split streaming data without loading into memory
curl -s "http://example.com/largefile" | split -b 10M - stream_part_

Parallel Processing

# Split and process multiple files in parallel
for file in *.log; do
split -b 50M "$file" "${file}_" &
done
wait

# Parallel processing of split chunks
split -l 100000 huge_file.txt chunk_
for chunk in chunk_*; do
process_chunk "$chunk" &
done
wait

# Distributed processing
split -n 10 large_file.txt distributed_part_
for i in {0..9}; do
process_chunk_on_node "$i" "distributed_part_$i" &
done
wait

I/O Optimization

Storage Optimization

# Split to different disks for performance
split -b 500M /source/huge_file /target/disk1/part_
split -b 500M /source/huge_file /target/disk2/part_

# Use temporary directory for large operations
TMPDIR=/fast/temp split -b 1G huge_file.txt part_

# Split with compression for space efficiency
split -b 100M large_file.txt | gzip > compressed_parts.gz

Network Optimization

# Split for network bandwidth limits
split -b 10M file_for_transfer.txt net_part_

# Split and transfer in parallel
split -n 5 large_file.txt transfer_part_
for part in transfer_part_*; do
scp "$part" remote:/target/ &
done
wait

# Split for CDN distribution
split -b 50M website_assets.zip cdn_asset_

Automation and Scripting

Backup Script with Splitting

#!/bin/bash
# Advanced backup script with splitting and verification

BACKUP_DIR="/backups"
SOURCE_DIR="/important/data"
DATE=$(date +%Y%m%d_%H%M%S)
CHUNK_SIZE="1G"
PREFIX="backup_${DATE}"

# Create backup
echo "Creating backup..."
tar -czf "${BACKUP_DIR}/${PREFIX}.tar.gz" "$SOURCE_DIR"

# Split backup
echo "Splitting backup into ${CHUNK_SIZE} chunks..."
split -b "$CHUNK_SIZE" "${BACKUP_DIR}/${PREFIX}.tar.gz" "${BACKUP_DIR}/${PREFIX}_part_"

# Generate checksums
echo "Generating checksums..."
for chunk in "${BACKUP_DIR}/${PREFIX}_part_"*; do
sha256sum "$chunk" > "${chunk}.sha256"
done

# Create reassembly script
cat > "${BACKUP_DIR}/reassemble_${PREFIX}.sh" << EOF
#!/bin/bash
echo "Reassembling backup ${PREFIX}..."
cat ${PREFIX}_part_* > ${PREFIX}.tar.gz

echo "Verifying checksums..."
for chunk in ${PREFIX}_part_*; do
if ! sha256sum -c "\${chunk}.sha256"; then
echo "❌ Checksum verification failed for \$chunk"
exit 1
fi
done

echo "✅ Backup reassembled and verified successfully"
EOF

chmod +x "${BACKUP_DIR}/reassemble_${PREFIX}.sh"

# Clean up original large file
rm "${BACKUP_DIR}/${PREFIX}.tar.gz"

echo "Backup completed. Use ${BACKUP_DIR}/reassemble_${PREFIX}.sh to reassemble."

Log Rotation with Splitting

#!/bin/bash
# Advanced log rotation with intelligent splitting

LOG_DIR="/var/log"
ARCHIVE_DIR="/var/log/archive"
MAX_SIZE="100M"
RETENTION_DAYS=30

# Process each log file
for log_file in "$LOG_DIR"/*.log; do
filename=$(basename "$log_file")

# Check if file needs rotation
if [ -f "$log_file" ] && [ $(stat -f%z "$log_file" 2>/dev/null || stat -c%s "$log_file") -gt 104857600 ]; then
echo "Rotating $filename..."

# Split log file
split -b "$MAX_SIZE" "$log_file" "$ARCHIVE_DIR/${filename}_$(date +%Y%m%d)_part_"

# Compress split files
for chunk in "$ARCHIVE_DIR/${filename}_$(date +%Y%m%d)_part_"*; do
gzip "$chunk"
done

# Clear original log
> "$log_file"

# Create index file
echo "Log rotation completed for $filename on $(date)" > "$ARCHIVE_DIR/${filename}_$(date +%Y%m%d).index"
fi
done

# Clean old archives
find "$ARCHIVE_DIR" -name "*.gz" -mtime +$RETENTION_DAYS -delete
find "$ARCHIVE_DIR" -name "*.index" -mtime +$RETENTION_DAYS -delete

echo "Log rotation completed."

Data Processing Pipeline

#!/bin/bash
# Data processing pipeline with parallel processing

INPUT_FILE="large_dataset.csv"
OUTPUT_DIR="processed_chunks"
CHUNK_LINES=10000
MAX_PROCESSES=4

# Create output directory
mkdir -p "$OUTPUT_DIR"

# Split input file
echo "Splitting input file..."
split -l "$CHUNK_LINES" "$INPUT_FILE" "$OUTPUT_DIR/chunk_"

# Process chunks in parallel
echo "Processing chunks..."
counter=0
for chunk_file in "$OUTPUT_DIR"/chunk_*; do
# Limit concurrent processes
if [ $((counter % MAX_PROCESSES)) -eq 0 ]; then
wait # Wait for current batch to complete
fi

# Process chunk in background
{
echo "Processing $(basename "$chunk_file")"
python process_data.py "$chunk_file" "${chunk_file}_processed"
echo "Completed $(basename "$chunk_file")"
} &

((counter++))
done

# Wait for all processes to complete
wait

# Reassemble processed data
echo "Reassembling processed data..."
cat "$OUTPUT_DIR"/chunk_*_processed > "$OUTPUT_DIR/final_processed.csv"

# Clean up temporary chunks
rm "$OUTPUT_DIR"/chunk_*

echo "Processing pipeline completed: $OUTPUT_DIR/final_processed.csv"

Troubleshooting

Common Issues

Reassembly Problems

# Problem: Files not in correct order
# Wrong way - may sort incorrectly
cat part_* > incorrect_order.txt

# Correct way - ensure proper ordering
cat $(ls part_* | sort -V) > correct_order.txt

# For numeric suffixes
cat $(ls part_* | sort -n) > numeric_order.txt

# For alphabetical suffixes
cat $(ls part_* | sort) > alpha_order.txt

Space and Size Issues

# Problem: Split parts larger than expected
# Issue: Not accounting for overhead
split -b 1G exact_file.txt part_ # May create parts slightly > 1G

# Solution: Account for overhead
split -b 900M exact_file.txt part_ # Leaves room for overhead

# Problem: Disk space insufficient
# Solution: Split directly to different location
split -b 1G /source/large_file.txt /different/disk/prefix_

# Split and compress on the fly
split -b 100M large_file.txt | gzip > compressed_parts.gz

Permission and Access Issues

# Problem: Permission denied
# Solution: Check and fix permissions
ls -la /path/to/file
chmod 644 /path/to/file

# Split to directory with write permissions
split -b 1M large_file.txt /tmp/split_part_

# Problem: File locked by another process
# Solution: Wait or create copy
cp --reflink=auto /locked/file.txt /tmp/file_copy.txt
split -b 1M /tmp/file_copy.txt part_

Performance Issues

Slow Splitting Operations

# Problem: Splitting large files is slow
# Solution: Use appropriate chunk sizes
split -b 10M huge_file.txt part_ # Smaller chunks = faster splitting

# Solution: Use parallel processing for multiple files
for file in *.large; do
split -b 50M "$file" "${file}_" &
done
wait

# Solution: Use faster storage
TMPDIR=/fast/temp split -b 1G /slow/storage/large_file.txt part_

Memory Issues

# Problem: Out of memory during splitting
# Solution: Use streaming
cat large_file.txt | split -b 1M - part_

# Solution: Reduce buffer size
split -b 1M --buffer-size=64K large_file.txt part_

# Solution: Split to disk directly, not memory
split -b 100M large_file.txt /tmp/part_

Recombination and Verification

Standard Reassembly Methods

Basic Reassembly

# Reassemble alphabetical files
cat xaa xab xac > reassembled_file.txt

# Reassemble all alphabetical files
cat x* > reassembled_file.txt

# Reassemble numeric files
cat x00 x01 x02 > reassembled_file.txt

# Reassemble all numeric files
cat $(ls x* | sort -V) > reassembled_file.txt

Advanced Reassembly

# Reassemble with progress indication
total_files=$(ls part_* | wc -l)
current=0
for file in $(ls part_* | sort -V); do
((current++))
echo "Processing file $current of $total_files: $file"
cat "$file" >> reassembled_file.txt
done

# Reassemble with checksum verification
cat part_* > temp_reassembled.txt
if md5sum -c original.md5 < temp_reassembled.txt; then
mv temp_reassembled.txt final_file.txt
echo "✅ Reassembly successful"
else
echo "❌ Reassembly failed - checksum mismatch"
rm temp_reassembled.txt
fi

Verification and Validation

Integrity Checking

# Verify by line count
original_lines=$(wc -l < original_file.txt)
reassembled_lines=$(wc -l < reassembled_file.txt)
echo "Original lines: $original_lines, Reassembled lines: $reassembled_lines"

# Verify by byte count
original_bytes=$(wc -c < original_file.txt)
reassembled_bytes=$(wc -c < reassembled_file.txt)
echo "Original bytes: $original_bytes, Reassembled bytes: $reassembled_bytes"

# Verify by checksum
original_md5=$(md5sum < original_file.txt)
reassembled_md5=$(md5sum < reassembled_file.txt)
echo "Original MD5: $original_md5"
echo "Reassembled MD5: $reassembled_md5"

# Verify file content differences
diff original_file.txt reassembled_file.txt
if [ $? -eq 0 ]; then
echo "✅ Files are identical"
else
echo "❌ Files differ"
fi
  • cat - Concatenate files and print on standard output
  • csplit - Split files based on context
  • awk - Pattern scanning and processing language
  • tar - Tape archive utility
  • dd - Convert and copy a file
  • cut - Remove sections from each line of files
  • head - Output the first part of files
  • tail - Output the last part of files
  • wc - Print newline, word, and byte counts
  • sort - Sort lines of text files
  • join - Join lines of two files on a common field

Best Practices

  1. Choose appropriate split sizes based on your use case (email limits, media constraints, network bandwidth)
  2. Use meaningful prefixes for easy identification and organization of split files
  3. Generate checksums for each split file to ensure integrity during transfer and storage
  4. Document reassembly procedures and create scripts for automatic reassembly
  5. Consider compression before splitting for space efficiency, especially for text files
  6. Use numeric suffixes when expecting large numbers of split files to avoid naming conflicts
  7. Test reassembly process before deleting original files to ensure splitting was successful
  8. Use parallel processing for handling multiple large files or large datasets
  9. Monitor disk space during splitting operations to avoid running out of storage
  10. Keep split files organized in dedicated directories with clear naming conventions

Performance Tips

  1. Smaller chunks process faster but create more files (find the right balance)
  2. Use streaming with pipelines to avoid loading entire files into memory
  3. Parallel processing can significantly speed up handling of multiple files
  4. Temporary directory on fast storage improves I/O performance
  5. Compression before splitting reduces I/O and storage requirements
  6. Buffer size tuning can optimize performance for specific file types
  7. SSD storage dramatically improves splitting performance for large files
  8. Network considerations - split sizes should match bandwidth and latency requirements

Security Considerations

  1. Sensitive data should be encrypted before splitting for transfer
  2. Checksum verification ensures files haven't been tampered with during transfer
  3. Secure temporary files when splitting sensitive information
  4. Access permissions should be set appropriately for split files
  5. Audit trail should be maintained for sensitive file operations
  6. Clean up temporary split files containing sensitive data
  7. Secure deletion of original files after successful splitting and verification

The split command is an essential utility for managing large files and datasets. Whether for backup operations, data processing, file transfer, or overcoming size limitations, it provides reliable and flexible file splitting capabilities that integrate seamlessly with the Unix toolchain. Its ability to work with various splitting criteria, naming conventions, and integration through pipelines makes it an indispensable tool for system administrators, developers, and data processors.