GGF: Graph Genome Format Specification
# GGF: Graphical Genome Fragments Format 2.0 Specification

## Introduction

GGF is Graphical Genome Fragments format, which allows representing variation graphs from population-level, e.g. population reference genome, to a personal genome, e.g. genotyping, annotations, structural variants, alignment of reads, using the concept of sequence graphs. 

GGF is an extension of Graphical Fragment Assembly 2.0 (GFA 2.0) format for describing genome graphs, especially for assembly graphs and backward compatibility with GFA 2.0 is completely guaranteed.
For handling the challenges described in the background, some novel types of line are introduced or ported from GFA 1.

Here we describe the grammar, semantics, examples, and use-cases of GGF format.

## Background (Tentative)

The advent of next-generation sequencers underscores the importance of the data format for handling a large number of genomes with variations of different scales. A common format to describe variations is required to store with lesser resource and compare data from multiple sources, across projects, and sequencing platforms.

VCF or GVF is proposed to describe SNPs, structural variations, and ploidy, but it often too complicated to describe and interpret nested insertions, tandem duplications, genotypes in complex SVs. Moreover, VCF has been insufficient to describe structural variations without optional-tags.

Neither non-standardized tabular text files nor "diff-styled" variant file is not enough to describe complex structural variations or genotyping. Therefore, some graph-based representations of genomes have been considered promising especially in human variation analysis.

Graph genome formats designed for describing assembly graphs such as GFA, ASQG, FASTG are not suitable for variation graphs since they only permit too redundant representation to describe small variations or haplotypes in minimum efforts.

GGF format is designed to describe genotypes, SNPs in an easier way on the genome graph, which enables to parse with your own code and handle with traditional Unix commands like sed or awk. This data structure may be efficient for interpreting through visualization, personal genome analysis, population variation analysis such as GWAS on the genome graph. 

## Grammar
```
<spec>        <-  <GGF header> { <external:GFA2 grammer> | <link:GFA1> | <genotype> | <variants> | <annotations> }+

<GGF header>  <- # GGF {VN:Z:2.0} <tag>*

<link:GFA1>   <- L {sid1:ref} {sid2:ref} (* | 0M)

<genotype>    <- { <o_genotype> | <u_genotype> | <s_genotype> <genotype> <e_genotype> }+

    <o_genotype> <- W <name:id> <sid:id> <len:opt_pos> <beg:pos> <end:pos> <haplotype>
    <u_genotype> <- w <name:id> <sid:id> <len:opt_pos> <beg:pos> <end:pos> <haplotype>

    <s_genotype> <- [ <name:id> <repeat_num:int> <repeat_lb:int> <repeat_ub:int> <repeat_interval:int> <tag>*
    <e_genotype> <- ]

<variants>    <- V <sid:id> (* | <length:pos>) <beg:pos> <alt:sequence> <tag>*

<annotations> <- A <annot_id:id> <sid:id> <len:opt_pos> <beg:pos> <end:pos> <type:[!-~]+> <tag>*

    <opt_pos>    <- <pos> | * 

    <haplotype>  <- { A | T | G | C | - | * | \[ <haplotype> \] }+

    <flag>       <- <flag_str>(,<flag_str>)*

        <flag_str>       <- { M | U | 2 | S | [a-z]+ }*

    <segment_id> <- [1-9][0-9]*
```
In the grammar above all symbols are literals other than tokens between <>, the derivation operator <-, and the following marks:

* {} enclose an optional item
* | denotes an alternative
* * zero-or-more
* + one-or-more
* [] a set of one character alternatives.
* \\[, \\] are regarded as a characher [, ].

Like GFA, GFA2 is tab-delimited in that every lexical token is separated from the next by a single tab.

## Semantics

### GGF Header(#)

**GGF header** should be located on the top of remaining lines.
The header contains an optional 'VN' SAM-tag version number, 2.0. If a GGF file is loaded as GFA 2.0 format, then this line and tags are ignored.

### Segment(S)

To distinguish segment name and path name, segment name(id) should be

```
 <sid> <- [1-9][0-9]*
```

It is compatible with gfakluge, which is used in VG.

### Link(L)

Since GFA 2.0's **Edge** is too complicated to describe the end-to-end connection between segments, **Link** is ported from GFA 1.0 with a constraint that CIGAR is 0M. It means that the overlaps between segments are prohibited.

Example:
```
S	11	ACCTT
S	12	TCAAGG
L	11	+	12	-	0M
```
Then, the sequence should be "ACCTTTCAAGG".

### Genotype(W/w named after walk)

A **Genotype** encoding on a W- or w- line allows one to name and specify a genotype in the graph. W-lines encode *ordered* collections like path and w-lines encode *unordered* genotypes, e.g. uncertainty the order of repeats in the telomere. The remainder of the line then consists of a required ID for the collection followed by a segment ID with an orient, length(optional), start position, end position, and haplotypes defined in the next paragraph.

Haplotypes are described as the sequence of alleles in the order of V-line and it permits only variants described in the format. When an allele is single nucleotide polymorphisms, the alternative nucleotide is described with a character such as A, C, G, T, -(hyphen=deletion), *(no data). It permits the unphased allele in the phased region. In the case of short in/dels, the alternative sequence must be enclosed in parentheses. The order of a character and parentheses corresponds to the order of V-line in the same segment.

Since the multi-line path format was adopted, it is not an error for W- and w- line to have the same name. W- and w- line can share the IDs, and lines sharing the same ID is regarded as the same genotype. While the order of the same ID of w-lines is ignored, the order of the same ID of W-lines is corresponding to the order of segments in the path, which means it allows the same genotype ID is shared with W- and w- line, and the only order of sequential w- lines is ignored.

This line relates whether U/O-lines with the same name could be considered to be concatenated together in the order in which they appear (see #54 and #47). 

Example:
```
S    2    ACGTGTAAACCCT
V    2    2    1    A
V    2    4    1    C
V    2    4    1    AG
W    1    2+    13    0   13   AC   ->  ACATCTAAACCCT
W    2    2+    13    0   13   GC   ->  ACGTCTAAACCCT
W    3    2+    13    0   13   G[AG]   ->  ACGTAGTAAACCCT
```

#### Repeated Segments ([, ])

The lines enclosed by [- line and ]- line are annotated as a repeat region. When only either one of [- line or ]- line exists, it is not acceptable. On the other hand, nested repeats are acceptable. The remainder of the line then consists of the lines to enclose, the lower-bound of lines which is observed from actual data, the upper-bound, and intervals.

Example:
```
W    1    1+    100    5    100    ATTTC-*AT
[    6    4    8    2
w    1    2+    100    0    100    ATGTGT
w    1    3+    100    0    100    TGTGT
w    1    2+    100    0    100    ATGTGT
w    1    3+    100    0    100    TGTGT
w    1    2+    100    0    100    TGTGT 
w    1    3+    100    0    100    TGTGT
]    
W    1    4+    100    0    70     C-AT**AAG
```

It indicates repeats of \[2+, 3+\] (first 2 segments are selected because the interval is 2) are minimum 2 times, maximum 4 times.

### Variants(V)

**Variants**, if present, are encoded in V- lines that give the alternative sequences in the segment, which is used when the SNP or short in/dels are too short to split segments on the graph.
V- line consists of the segment id, offset described below, length, and alternative sequence. It looks like VCF but the definition of position differs. The optional tag can store the allele frequency or annotation if you want.

The offset is 0-based and counts from the start of the segment: 

```
//     seq+        G A T T A C A
//     offset+  → 0 1 2 3 4 5 6 7
```

Example: 

```
S    2    ACGTCT
V    2    2    3    A    
V    2    2    3    AAA
V    2    2    3    AAAAA
```

It indicates there are 3 alternative sequence in segment 2, which are "ACAT", "ACAAAT", "ACAAAAAT".

### Annotations(A)

While an assembly graph doesn't need annotations such as a gene, exon or structural variation, a variation graph requires annotations described as GFF or GTF formats.

Examples:
```
A    1    1+    100    20    100    gene    p53
A    1    2+    100    00    100    gene
```

The first line of a series of annotation can store the supplementary information such as a gene name on the optional tag, and the remaining lines can omit them.

### Alignments($ named after synteny)

At first, we were planned to add an alignment syntax below, but it is replaced with existing fragment line or edge line in GFA 2.

```
<alignments>  <- $ <name:id> <flag> <sid:id> <s_len:opt_pos> <s_beg:pos> <s_end:pos> <q_len:opt_pos> <q_beg:pos> <q_end1:pos> <alignment>
```

If you have a raw read, then you can describe in F- line.

```
F    1    read+    0    10    0    10    10M     
```

If you have a synteny information between segments, you can describe in E- line, but it does not indicate that there is a link between them. (It is different with GFA 2.0 format definition.)

```
E    1+    2+    0    10    2    12    12M
```

## Usecases

### To represent a topology of genome graph.
Use S- and L- lines.

### To represent SNPs in graph genomes 
Use V lines. Use W- and w- lines if you have a phased data.

### To represent haplotypes in multiploid genome or cancer genome
Use W- and w- lines.

### To enumerate all haplotypes.
Parse W- and w- lines.

### To query haplotypes in imputation.
Parse W- and w- lines and extract alleles, construct haplotype graph, and import into XG.

### To add annotation to genome graph.
Use A- lines.

### To add synteny information between segments.
Use E- lines.

### To add alignment of reads.
Use F- lines.

## Q&A
### Q. Is a subgraph permitted?
A. No.

### Q. How to represent the time-series change of DNA sequence?

WIP


## Supplementary Queries for Constructing Graph Genome (Tentative)

The GGF format permits the difference file -- e.g. if there are 900 genomes graph and additional 100 genomes data, 1000 genomes can be constructed by adding edges, segments, and paths into 900 genomes without reconstruction of 1000 genomes graph from scratch.

The difference file has a path for an original file and MD5-sum in a header. These queries can be described in the difference file.

* Add edges
* Add segments
* Split segments

The format is designed for acceptable these queries, but the library is not implemented...

## Supplementary Formats

For supporting GGF format ecosystem to work well, supplementary formats below are introduced.

* PCF format describes genomic features such as structural variants.
* A format for lifting coordinates to describe insertion sequence. (WIP)

### Pair of genomic Coordinates Format(PCF Format)
PCF is a comma-separated format describing the features between genomic coordinates.

Structural variations can be described as a set of segments and links, and an ordered set of nodes is a path, which is annotated with a name of structural variations such as inversion, deletion, and so on. This format works as a human-readable index for such a structural variation.

Col | Type | Description
-----|--------|---------------------
1|string|source path name (/ segment id)
2|int|source path coordinate
3|+/-| source strand
4|string| target path name (/ segment id)
5|int| target path coordinate
6|+/-| target strand
7|int(opt.)| priority | likelyhood(u8)
8|string(opt.)| sv type
9|string(opt.)| source annotation
10|string(opt.) | target annotation
-|optional| optional (color, insertion sequence)

## Implementations

Currently, these tools are developing in our lab below.

* SV caller
* Graph Genome Browser

And these tools may be useful because of compatibility with GFA.

* Bandage
* VG

## References
* https://github.com/GFA-spec/GFA-spec/blob/master/GFA2.md