Building an Autonomous Epigenetic CRISPR-Targeting Pipeline | Epigenetic CRISPR Pipeline - Dataspheres AI

Building an Autonomous Epigenetic CRISPR-Targeting Pipeline May 25, 2026 · facelessaicoder This isn't a polished retrospective. It's a live engineering log...

Building an Autonomous Epigenetic CRISPR-Targeting Pipeline May 25, 2026 · facelessaicoder This isn't a polished retrospective. It's a live engineering log from the middle of a hard problem — an autonomous pipeline that ingests whole-genome bisulfite sequencing data, calls methylation-aware SNPs against a gold-standard benchmark, designs CRISPR guide RNAs targeting epigenetically active regions, and validates structural binding impact using AlphaFold2 and delta-delta-G estimates. Three tiers of biology, each with its own dependency nightmare. We're not done. The pipeline is running right now, aligning ~57GB of compressed FASTQ reads to GRCh38 while you read this. The architecture is solid, the math is understood, and the path to F1 > 0.97 is clear. Here's everything — the raw setup pain, what the tools actually do, the data complexity, the math, and where we stand on the board. Part I: The Setup Gauntlet The Environment Problem This project runs on Windows 11 with WSL2 (Ubuntu), and that combination introduces a class of infrastructure problems that have nothing to do with bioinformatics. The filesystem split. WSL2 runs a real Linux kernel in a lightweight Hyper-V VM. It has two distinct storage domains: the Windows NTFS filesystem (accessible from Linux at /mnt/c/ ) via the 9P network protocol, and the native Linux ext4 virtual disk. The 9P protocol works fine for small files. For large files — like 57GB of gzip-compressed FASTQ data read at full alignment bandwidth — it becomes catastrophically unstable. We hit this hard: biscuit would stall mid-alignment, the 9P session would crash with E_UNEXPECTED , and the entire pipeline would freeze with no error message. The fix was surgical: get all FASTQs onto the Linux ext4 filesystem before touching them. But copying 57GB from Windows NTFS to Linux ext4 has its own trap — you can't use cp from inside WSL2 at full bandwidth without triggering the same 9P instability. The solution: copy from the Windows side using PowerShell's Copy-Item , targeting the WSL filesystem via UNC path ( \\wsl$\Ubuntu\home\facel\fastq ). This bypasses 9P entirely and uses the Hyper-V integration layer directly. 185GB moved cleanly. The OOM kill. With all data on ext4, the pipeline launched and ran for 45 minutes before the Linux OOM killer terminated biscuit mid-alignment. The root cause: GRCh38 is a 3.1GB FASTA file but requires a ~12GB in-memory index at alignment time. Add 12 concurrent read-processing threads each buffering ~1.5GB of alignment state, and biscuit alone was consuming ~30GB RSS on a system with 32GB WSL2 RAM. The samtools sort process receiving the pipe added another 3–4GB. The system ran out. The fix: reduce biscuit threads from 12 to 6 (peak RSS drops to ~18GB) and add -m 1G to all samtools sort calls to cap per-thread sort buffers. Total pipeline memory now peaks around 22GB — safe margin. The dependency wall. Tier 3 requires two academic-licensed packages: FoldX (delta-delta-G calculations) and PyRosetta (protein energy minimization). Both require manual institutional license applications. No mocks, no workarounds — those are project rules. The WSL2 Filesystem Architecture |PowerShell Copy-Item\nbypasses 9P| EXT4 WIN --> NTFS_MOUNT WSL --> EXT4 WSL --> NTFS_MOUNT EXT4 --> FASTQ_EXT4 NTFS_MOUNT --> REF NTFS_MOUNT --> RESULTS"> The Data Downloads Getting the raw data onto disk required the NCBI SRA Toolkit ( prefetch + fasterq-dump ) to download three sequencing runs of NA12878. This took 8+ hours across multiple sessions with checkpointing: Accession Compressed Size Coverage Contribution SRR949193 ~130GB uncompressed ~11x SRR949194 ~29GB compressed ~11x SRR949195 ~28GB compressed ~11x Combined: ~33x depth on chr22. The GRCh38 reference adds 3.1GB, the GIAB benchmark VCF/BED adds ~500MB, and the rtg SDF adds ~800MB. Total data on disk: ~185GB. None of it fits in RAM. Part II: The Data at Genome Scale What's in a FASTQ? Every sequencing read is stored in 4-line FASTQ format — read ID, nucleotide sequence (100–150 bp), separator, and Phred quality scores encoded as ASCII. For whole-genome bisulfite sequencing (WGBS), the reads have been chemically treated before sequencing, which is where the real complexity begins. Bisulfite Conversion Chemistry Bisulfite treatment converts unmethylated cytosine (C) to uracil (U → T after PCR), while methylated 5mC remains as C. This C→T asymmetry is what standard aligners misread as SNPs. Standard aligners (BWA, Bowtie2) would interpret the C→T conversions from bisulfite treatment as millions of SNPs. biscuit is purpose-built for WGBS: it maintains two strand-specific reference copies of the genome and performs bisulfite-aware alignment, tagging each read with methylation information via ZC and ZR BAM tags. The mathematical consequence is that bisulfite sequencing reduces effective sequence complexity — a genomic C that's unmethylated becomes informationally equivalent to T, increasing multi-mapping and requiring higher coverage