Structural Family Classification app icon

Tool · Desktop · Windows · Experimental

Structural Family Classification

结构家族分类

Classify SELEX-derived DNA aptamers into structural families by their loop region, not their full-length sequence. Multi-structure suboptimal folding plus weak-helix dissolution lets two aptamers with different stems but the same loop architecture land in the same family.

面向 SELEX 适配体的环路结构家族分类——按环路区域而非全长序列分组。多结构次优折叠 + 弱螺旋融合,使茎区不同但环路结构一致的适配体落入同一家族。

v0.9.4 · Built on AptaFold + Aptamer Analysis · Loop families still under wet-lab validation — treat output as a structure-aware shortlist.

/ Introduction · 介绍

What it does.这个工具能做什么。

Conventional clustering of post-SELEX sequence pools matches aptamers by full-length string similarity — which conflates "same fold, different stem" cases. This pipeline switches the comparison to the loop region after weak-helix dissolution: every sequence is folded into up to 25 suboptimal structures, the loops are extracted from each structure, and helices with ≤ 3 bp are dissolved so flanking loops merge. The merged loops are then grouped by a purpose-built, from-scratch conserved-motif classifiernot a reuse of the Aptamer Analysis Tool's k-mer / DBSCAN engine. Instead of measuring whole-loop string distance (which dilutes a short conserved core inside a long variable loop), it discovers the conserved blocks a loop carries and groups loops that share the same block combination. There is no N×N distance matrix, so the classifier has no corpus-size ceiling and runs in seconds on a full pool.

传统 SELEX 后聚类按全长序列相似度匹配适配体——会把"折叠相同、茎区不同"的情况混淆。本流水线将比较对象切换为弱螺旋融合后的环路区域:每条序列产生最多 25 个次优结构,从每个结构中提取环路,≤ 3 bp 的弱螺旋被融合使相邻环路合并。随后用一套完全自主重写的保守 motif 分类引擎对融合后的环路分组——并非复用 Aptamer Analysis Tool 的 k-mer / DBSCAN。它不再测量整条环路的字符串距离(那样会让长可变环路里一小段保守核心被稀释),而是发现每条环路携带的保守区块(block),把拥有相同区块组合的环路归为一族。整个过程没有 N×N 距离矩阵,因此不受语料规模上限限制,全库分类只需数秒。

/ Pipeline · 流水线

15 000 trimmed sequences
        │
        ▼  Fold  —  reuses AptaFold's seqfold / ViennaRNA suboptimals
   02_fold_predictions.csv
        │
        ▼  Loop extraction  (self-written)
   03_raw_loops.csv           dot-bracket → pair table → loops
   04_merged_loops_bp3.csv    weak-helix dissolution (≤ 3 bp)
   05_loop_map_bp3.csv        loop ↔ (seq, fold, span, ΔG)
   06_loop_corpus_bp3.fasta   de-duped loop strings + recurrence
        │
        ▼  Motif-block classifier  (self-written — the core engine)
   ·  mine enriched k-mer cores → grow consensus blocks
   ·  family = the block COMBINATION a loop carries
   ·  strict full-consensus membership; no N×N matrix
        │
        ▼  Assign + post-process  (self-written)
   ·  dual-anchor merge   — fuse families sharing a 5′ AND 3′ framework
   ·  ungrouped rescue    — recover high-read near-identical winners
   07_aptamer_family_bp3.csv        per-aptamer family
   08_family_representatives_bp3.csv  picks for validation

/ Usage · 使用方式

How to use it.使用流程。

The whole pipeline is wrapped in one desktop GUI (启动GUI.bat on Windows). Folding is launched as a separate subprocess so the GUI stays responsive throughout.

整套流水线封装在一个桌面 GUI 中(Windows 上 启动GUI.bat 一键启动)。折叠任务在独立子进程中运行,主界面始终保持响应。

  1. 01

    Launch & load sequencing data

    启动并导入测序数据

    Open the GUI. Pick a FASTQ file (will be trimmed) or an already-trimmed FASTA / CSV from Aptamer Analysis Tool. Read counts are preserved if the FASTA headers carry count=N.

    打开 GUI,导入 FASTQ 文件(自动 trim)或 Aptamer Analysis Tool 已 trim 的 FASTA / CSV。如 FASTA 头部含 count=N,read 数会被保留。

    Structural Family Classification main window with parameter panels
    Fig. 01 — Main window · 主窗口(导入 / Trim / 折叠 / 家族分类四组参数)
  2. 02

    Configure trim, folding & family parameters

    设置 trim、折叠与家族参数

    In the four parameter panels: pick a trim mode and library length N; pick a folding engine (viennarna recommended, salt-sensitive, ~20–30× faster than alternatives); set [Na⁺] / [Mg²⁺] / [K⁺] / [Ca²⁺] and temperature to match your selection buffer; set the loop merge threshold bp (auto by default), target family count, and motif strictness.

    在四组参数面板中:选择 trim 模式与文库长度 N;选择折叠引擎(推荐 viennarna,盐离子敏感,比其他引擎快约 20–30 倍);设置 [Na⁺] / [Mg²⁺] / [K⁺] / [Ca²⁺] 与温度以匹配筛选缓冲液;设置环路融合阈值 bp(默认 auto)、目标家族数量、motif 严格度。

  3. 03

    Run the pipeline

    运行流水线

    Folding (Stage 1/4) is the slow part — every sequence × fold_budget140 constrained refolds, run in parallel across all CPU cores. Stages 2/4 (loop extraction + merge), 3/4 (motif-block classification), and 4/4 (assign + dual-anchor merge + rescue) each finish in seconds once folding is cached. The bottom-left console reports each stage's row counts and the run's headline stats (aptamers, total folds, loop corpus size, families, grouped fraction, representatives).

    折叠(阶段 1/4)是最慢的一步——每条序列 × fold_budget(约 140 次受限重折),跨所有 CPU 核并行。一旦折叠结果缓存,阶段 2/4(环路提取与融合)、3/4(motif 区块分类)、4/4(归属 + 双锚点合并 + 救回)均秒级完成。左下角控制台逐阶段汇报行数与运行统计(aptamers、total folds、loop corpus、families、grouped 比例、代表序列数)。

    Loop pipeline GUI after running on a real Acet SELEX FASTQ: 4111 aptamers, 20555 folds, 12542 unique loops, 82 families, 1581 aptamers grouped, first family preview visible
    Fig. 03 — Pipeline complete · 流水线完成(4111 aptamers · 82 families · 1581 grouped · 234 reps)
  4. 04

    Review families & iterate

    查看家族结果并迭代

    The result panel renders each family in a distinct colour: highlighted bases = the family's conserved motif, gray = primer flanks, white = variable N region. Each chain card also draws the full-insert 2D secondary structure (dissolved at the merge threshold), with the conserved motif painted in the family colour. Tweak the merge bp, family count (min-family), motif-mismatch tolerance, or motif-extension cutoff and click Re-classify — the fold cache is reused, so a new family palette appears in seconds.

    结果面板按家族用不同颜色显示:高亮碱基 = 该家族保守 motif,灰色 = 引物恒定区,白底 = N 区可变部分。每条 chain 卡片同时画出完整插入序列的 2D 二级结构(按融合阈值溶解),保守 motif 用家族颜色高亮。调整融合 bp、家族数量(min-family)、motif 错配容忍度或 motif 延伸阈值后点击 重新分类,复用折叠缓存,秒级出新结果。

    Result panel showing FAM_CATAACTC family (130 chains) with the conserved CATAACTC motif highlighted in orange across multiple chains, each with its own 2D fold cartoon Result panel scrolled to a third family with conserved CAGATCTTT / CAGAAAAT / CAGAGCAA motif variants highlighted in purple
    Fig. 04 — Family preview · 家族预览(橙色 motif = FAM_CATAACTC;紫色 motif = 另一家族;每行右侧 = 完整 2D 折叠)
  5. 05

    Export representatives for wet-lab

    导出家族代表序列做湿实验

    07_aptamer_family_bp3.csv assigns every aptamer in the pool to a family. 08_family_representatives_bp3.csv picks one representative per family (ranked by read count) — the shortlist you take into CD / ITC / FRET validation.

    07_aptamer_family_bp3.csv 给文库中每一条适配体分配家族归属。08_family_representatives_bp3.csv 按 read 数排序为每个家族选一条代表序列——可直接进入 CD / ITC / FRET 验证的候选清单。

/ Disclaimer · 声明

Things to know.使用前请阅读。

  1. Experimental — a preview build, not yet validated.

    实验性预览版,尚未完成校准

    Loop-centered family assignment is still an engineering hypothesis. Wet-lab validation (CD melting + ITC across representatives of every family from a real SELEX pool) is ongoing. The build above (v0.9.4) is a usable preview so you can try the pipeline on your own pools — but treat its families as a structure-aware shortlist for triage, not a calibrated, final answer.

    环路中心的家族归属目前仍是工程假设。基于真实 SELEX 文库每个家族代表的 CD 熔解 + ITC 验证正在进行中。上方的 v0.9.4 是可用的预览版,方便你在自己的文库上试跑流程——但请把它给出的家族视为结构层面的初筛候选,而非已校准的最终结论。

  2. Folding inherits AptaFold's limits; the classifier is new.

    折叠继承 AptaFold 的限制;分类引擎是全新的

    The folding stage reuses AptaFold, so all of AptaFold's accuracy caveats (uncalibrated divalent-cation treatment, experimental engines) apply to the structures this tool classifies. The motif-block classifier, by contrast, is written from scratch for this pipeline — it is not the Aptamer Analysis Tool's k-mer / DBSCAN engine, and its loop-family assignments are themselves the experimental part still under validation.

    折叠阶段复用 AptaFold,因此 AptaFold 的全部精度说明(未校准的二价阳离子处理、实验性引擎)对本工具所分类的结构同样适用。而 motif 区块分类引擎是为本流水线从零编写的——它不是 Aptamer Analysis Tool 的 k-mer / DBSCAN,其环路家族归属本身正是仍在验证中的实验部分。

  3. Local processing, no telemetry.

    本地处理,无遥测

    No network calls during the pipeline. Sequences, folds, loops, clusters, and family assignments all live on your machine.

    流水线运行期间不发起任何网络请求。序列、折叠、环路、聚类与家族归属全部保留在本机。

  4. Will be free for academic use when released.

    发布后面向学术免费

    Free for academic / non-commercial use, on the same terms as the other Liu Lab tools (MIT-style permissive license). Please point colleagues to this page rather than redistributing the EXE, so they always get the latest preview as the pipeline is calibrated.

    学术 / 非商业用途免费,与 Liu Lab 其他工具一致(MIT 风格的宽松许可证)。请把同事指向本页而不是直接转发 EXE,以便他们随流水线校准始终拿到最新预览版。

  5. Provided as-is, without warranty.

    不提供任何形式的担保

    Family assignments are best-effort. Treat them as a structure-aware shortlist for wet-lab triage, not as final answers. Bug reports + early collaboration inquiries: [email protected].

    家族归属尽力而为,请将其视为结构层面的湿实验候选清单,而非最终结论。Bug 反馈与早期合作意向请发送至 [email protected]

/ Acknowledgments · 致谢

Built with the lab.和实验室一起完成。

Developed in the Bionanotechnology & Interfaces Laboratory at the University of Waterloo, led by Prof. Juewen Liu. This pipeline is the natural sequel to AptaFold and the Aptamer Analysis Tool — taking the question "what families are in a SELEX pool?" from sequence-level matching to structure-level matching.

本工具在加拿大滑铁卢大学刘珏文教授的 Bionanotechnology & Interfaces Laboratory 完成。它是 AptaFold 与 Aptamer Analysis Tool 的自然延续——把"SELEX 文库里有哪些家族?"这个问题从序列层面的匹配推进到结构层面的匹配。

What is reused, and what is new. Only the folding stage is borrowed. Everything downstream — loop extraction, weak-helix dissolution, the motif-block classifier, and the dual-anchor / rescue post-processing — was written from scratch for this pipeline.

哪些是复用、哪些是全新。只有折叠阶段是借用的。其后的全部环节——环路提取、弱螺旋融合、motif 区块分类器,以及双锚点合并 / 救回后处理——都是为本流水线从零编写的。

  1. Folding — reused from AptaFold

    折叠阶段(复用 AptaFold)

    Suboptimal folding via AptaFold's enumerate_seqfold_suboptimals() — up to 25 structures per sequence within 4 kcal/mol of the MFE, ranked by ΔG. This is the only borrowed stage; the classifier never recomputes a fold.

    复用 AptaFold 的 enumerate_seqfold_suboptimals() 提供次优折叠——每条序列在 MFE 上下 4 kcal/mol 内最多 25 个结构,按 ΔG 排序。这是唯一借用的阶段;分类器不重做折叠。

  2. Motif-block classifier — written from scratch

    motif 区块分类器(完全自写 · 核心创新)

    The core engine is original to this project, not the Aptamer Analysis Tool's k-mer / DBSCAN. It mines enriched k-mer cores, grows each into a consensus block, and names a family by the block combination a loop carries — with strict full-consensus membership and no N×N distance matrix. Two original post-processing passes follow: a dual-anchor merge (fuses families that share both a 5′ and a 3′ framework with a variable middle) and an ungrouped rescue (recovers near-identical high-read winners that fall below the motif-seeding threshold).

    核心引擎为本项目原创,并非 Aptamer Analysis Tool 的 k-mer / DBSCAN。它挖掘富集的 k-mer 核心、生长为共识区块,以一条环路携带的区块组合命名家族——采用严格的完整共识成员判定,且无 N×N 距离矩阵。其后还有两个原创后处理:双锚点合并(融合共享 5′ 与 3′ 框架、仅中间可变的家族)与未分组救回(救回低于 motif 播种阈值的近乎相同的高 reads 优势序列)。

  3. seqfold & ViennaRNA

    底层折叠库

    The two folding engines underneath AptaFold. ViennaRNA is the default in this pipeline for salt sensitivity and runtime; seqfold is the pure-Python fallback when ViennaRNA is unavailable.

    AptaFold 背后两套折叠引擎。本流水线默认使用 ViennaRNA(盐敏感、速度快);当 ViennaRNA 不可用时回退到纯 Python 的 seqfold。

Open-source libraries. The pipeline runs on top of these projects — thanks to their maintainers:

开源依赖。本流水线建立在以下项目之上,感谢各位维护者:

The motif-block classifier itself uses only the Python standard library — no clustering dependency. · 分类引擎本身仅依赖 Python 标准库,无聚类库依赖。

/ Also in the toolkit · 工具集

Other tools.同系列工具。