@gmod/gff
Read and write GFF3 data performantly. This module aims to be a complete implementation of the GFF3 specification.
NOTE: this module uses the NPM stream package, which requires node.js polyfills for use on the web. We also created the https://github.com/cmdcolin/gff-nostream package to allow a non-streaming version that does not require polyfills
- streaming parsing and streaming formatting
- proper escaping and unescaping of attribute and column values
- supports features with multiple locations and features with multiple parents
- reconstructs feature hierarchies of both
Parent
andDerives_from
relationships - parses FASTA sections
- does no validation except for referential integrity of
Parent
andDerives_from
relationships (can disable Derives_from reference checking withdisableDerivesFromReferences
) - only compatible with GFF3
Compatability
Works in the browser and with Node.js v18 and up.
Install
$ npm install --save @gmod/gff
Usage
Node.js example
import {
createReadStream,
createWriteStream,
readFileSync,
writeFileSync,
} from 'fs'
import { Readable, Writable } from 'stream'
import { TransformStream } from 'stream/web'
import {
formatSync,
parseStringSync,
GFFTransformer,
GFFFormattingTransformer,
} from '@gmod/gff'
// parse a file from a file name. parses only features and sequences by default,
// set options to parse directives and/or comments
;(async () => {
const readStream = createReadStream('/path/to/my/file.gff3')
const streamOfGFF3 = Readable.toWeb(readStream).pipeThrough(
new TransformStream(
new GFFTransformer({ parseComments: true, parseDirectives: true }),
),
)
for await (const data of streamOfGFF3) {
if ('directive' in data) {
console.log('got a directive', data)
} else if ('comment' in data) {
console.log('got a comment', data)
} else if ('sequence' in data) {
console.log('got a sequence from a FASTA section')
} else {
console.log('got a feature', data)
}
}
// parse a string of gff3 synchronously
const stringOfGFF3 = readFileSync('/path/to/my/file.gff3', 'utf8')
const arrayOfGFF3ITems = parseStringSync(stringOfGFF3)
// format an array of items to a string
const newStringOfGFF3 = formatSync(arrayOfGFF3ITems)
writeFileSync('/path/to/new/file.gff3', newStringOfGFF3)
// read a file, format it, and write it to a new file. inserts sync marks and
// a '##gff-version 3' header if one is not already present
await Readable.toWeb(createReadStream('/path/to/my/file.gff3'))
.pipeThrough(
new TransformStream(
new GFFTransformer({ parseComments: true, parseDirectives: true }),
),
)
.pipeThrough(new TransformStream(new GFFFormattingTransformer()))
.pipeTo(Writable.toWeb(createWriteStream('/path/to/my/file.gff3')))
})()
Browser example
import { GFFTransformer } from '@gmod/gff'
// parse a file from a URL. parses only features and sequences by default, set
// options to parse directives and/or comments
;(async () => {
const response = await fetch('http://example.com/file.gff3')
if (!response.ok) {
throw new Error('Bad response')
}
if (!response.body) {
throw new Error('No response body')
}
const streamOfGFF3 = response.body.pipeThrough(
new TransformStream(
new GFFTransformer({ parseComments: true, parseDirectives: true }),
),
)
for await (const chunk of streamOfGFF3) {
if ('directive' in data) {
console.log('got a directive', data)
} else if ('comment' in data) {
console.log('got a comment', data)
} else if ('sequence' in data) {
console.log('got a sequence from a FASTA section')
} else {
console.log('got a feature', data)
}
}
})()
Object format
features
In GFF3, features can have more than one location. We parse features
as arrayrefs of all the lines that share that feature's ID.
Values that are .
in the GFF3 are null
in the output.
A simple feature that's located in just one place:
[
{
"seq_id": "ctg123",
"source": null,
"type": "gene",
"start": 1000,
"end": 9000,
"score": null,
"strand": "+",
"phase": null,
"attributes": {
"ID": ["gene00001"],
"Name": ["EDEN"]
},
"child_features": [],
"derived_features": []
}
]
A CDS called cds00001
located in two places:
[
{
"seq_id": "ctg123",
"source": null,
"type": "CDS",
"start": 1201,
"end": 1500,
"score": null,
"strand": "+",
"phase": "0",
"attributes": {
"ID": ["cds00001"],
"Parent": ["mRNA00001"]
},
"child_features": [],
"derived_features": []
},
{
"seq_id": "ctg123",
"source": null,
"type": "CDS",
"start": 3000,
"end": 3902,
"score": null,
"strand": "+",
"phase": "0",
"attributes": {
"ID": ["cds00001"],
"Parent": ["mRNA00001"]
},
"child_features": [],
"derived_features": []
}
]
directives
parseDirective("##gff-version 3\n")
// returns
{
"directive": "gff-version",
"value": "3"
}
parseDirective('##sequence-region ctg123 1 1497228\n')
// returns
{
"directive": "sequence-region",
"value": "ctg123 1 1497228",
"seq_id": "ctg123",
"start": "1",
"end": "1497228"
}
comments
parseComment('# hi this is a comment\n')
// returns
{
"comment": "hi this is a comment"
}
sequences
These come from any embedded ##FASTA
section in the GFF3 file.
parseSequences(`##FASTA
>ctgA test contig
ACTGACTAGCTAGCATCAGCGTCGTAGCTATTATATTACGGTAGCCA`)[
// returns
{
id: 'ctgA',
description: 'test contig',
sequence: 'ACTGACTAGCTAGCATCAGCGTCGTAGCTATTATATTACGGTAGCCA',
}
]
API
Table of Contents
ParseOptions
Parser options
disableDerivesFromReferences
Whether to resolve references to derives from features
Type: boolean
parseFeatures
Whether to parse features, default true
Type: boolean
parseDirectives
Whether to parse directives, default false
Type: boolean
parseComments
Whether to parse comments, default false
Type: boolean
parseSequences
Whether to parse sequences, default true
Type: boolean
bufferSize
Maximum number of GFF3 lines to buffer, default Infinity
Type: number
GFFTransformer
Parse a stream of text data into a stream of feature, directive, comment, an sequence objects.
Parameters
options
O? Parser options
parseStringSync
Synchronously parse a string containing GFF3 and return an array of the parsed items.
Parameters
str
string GFF3 stringinputOptions
O? Parsing options
Returns any array of parsed features, directives, comments and/or sequences
formatSync
Format an array of GFF3 items (features,directives,comments) into string of GFF3. Does not insert synchronization (###) marks.
Parameters
items
Array<GFF3Item> Array of features, directives, comments and/or sequences
Returns string the formatted GFF3
FormatOptions
Formatter options
minSyncLines
The minimum number of lines to emit between sync (###) directives, default 100
Type: number
insertVersionDirective
Whether to insert a version directive at the beginning of a formatted stream if one does not exist already, default true
Type: boolean
GFFFormattingTransformer
Transform a stream of features, directives, comments and/or sequences into a stream of GFF3 text.
Inserts synchronization (###) marks automatically.
Parameters
options
FormatOptions Formatter options (optional, default{}
)
About util
There is also a util
module that contains super-low-level functions for dealing with lines and parts of lines.
import { util } from '@gmod/gff'
const gff3Lines = util.formatItem({
seq_id: 'ctgA',
...
})
util
Table of Contents
- unescape
- escape
- escapeColumn
- parseAttributes
- parseFeature
- parseDirective
- formatAttributes
- formatFeature
- formatDirective
- formatComment
- formatSequence
- formatItem
- GFF3Attributes
- GFF3FeatureLine
- GFF3FeatureLineWithRefs
- GFF3Feature
- BaseGFF3Directive
- GFF3SequenceRegionDirective
- GFF3GenomeBuildDirective
- GFF3Comment
- GFF3Sequence
unescape
Unescape a string value used in a GFF3 attribute.
Parameters
stringVal
string Escaped GFF3 string value
Returns string An unescaped string value
escape
Escape a value for use in a GFF3 attribute value.
Parameters
Returns string An escaped string value
escapeColumn
Escape a value for use in a GFF3 column value.
Parameters
Returns string An escaped column value
parseAttributes
Parse the 9th column (attributes) of a GFF3 feature line.
Parameters
attrString
string String of GFF3 9th column
Returns GFF3Attributes Parsed attributes
parseFeature
Parse a GFF3 feature line
Parameters
line
string GFF3 feature line
Returns GFF3FeatureLine The parsed feature
parseDirective
Parse a GFF3 directive line.
Parameters
line
string GFF3 directive line
Returns (GFF3Directive | GFF3SequenceRegionDirective | GFF3GenomeBuildDirective | null) The parsed directive
formatAttributes
Format an attributes object into a string suitable for the 9th column of GFF3.
Parameters
attrs
GFF3Attributes Attributes
Returns string GFF3 9th column string
formatFeature
Format a feature object or array of feature objects into one or more lines of GFF3.
Parameters
featureOrFeatures
(GFF3FeatureLine | GFF3FeatureLineWithRefs | Array<(GFF3FeatureLine | GFF3FeatureLineWithRefs)>) A feature object or array of feature objects
Returns string A string of one or more GFF3 lines
formatDirective
Format a directive into a line of GFF3.
Parameters
directive
GFF3Directive A directive object
Returns string A directive line string
formatComment
Format a comment into a GFF3 comment. Yes I know this is just adding a # and a newline.
Parameters
comment
GFF3Comment A comment object
Returns string A comment line string
formatSequence
Format a sequence object as FASTA
Parameters
seq
GFF3Sequence A sequence object
Returns string Formatted single FASTA sequence string
formatItem
Format a directive, comment, sequence, or feature, or array of such items, into one or more lines of GFF3.
Parameters
item
(GFF3FeatureLineWithRefs | GFF3Directive | GFF3Comment | GFF3Sequence)itemOrItems
A comment, sequence, or feature, or array of such items
Returns string A formatted string or array of strings
GFF3Attributes
A record of GFF3 attribute identifiers and the values of those identifiers
Type: Record<string, Array<string>>
GFF3FeatureLine
A representation of a single line of a GFF3 file
seq_id
The ID of the landmark used to establish the coordinate system for the current feature
Type: (string | null)
source
A free text qualifier intended to describe the algorithm or operating procedure that generated this feature
Type: (string | null)
type
The type of the feature
Type: (string | null)
start
The start coordinates of the feature
Type: (number | null)
end
The end coordinates of the feature
Type: (number | null)
score
The score of the feature
Type: (number | null)
strand
The strand of the feature
Type: (string | null)
phase
For features of type "CDS", the phase indicates where the next codon begins relative to the 5' end of the current CDS feature
Type: (string | null)
attributes
Feature attributes
Type: (GFF3Attributes | null)
GFF3FeatureLineWithRefs
Extends GFF3FeatureLine
A GFF3 Feature line that includes references to other features defined in their "Parent" or "Derives_from" attributes
child_features
An array of child features
Type: Array<GFF3Feature>
derived_features
An array of features derived from this feature
Type: Array<GFF3Feature>
GFF3Feature
A GFF3 feature, which may include multiple individual feature lines
Type: Array<GFF3FeatureLineWithRefs>
BaseGFF3Directive
A GFF3 directive
directive
The name of the directive
Type: string
value
The string value of the directive
Type: string
GFF3SequenceRegionDirective
Extends BaseGFF3Directive
A GFF3 sequence-region directive
value
The string value of the directive
Type: string
seq_id
The sequence ID parsed from the directive
Type: string
start
The sequence start parsed from the directive
Type: string
end
The sequence end parsed from the directive
Type: string
GFF3GenomeBuildDirective
Extends BaseGFF3Directive
A GFF3 genome-build directive
value
The string value of the directive
Type: string
source
The genome build source parsed from the directive
Type: string
buildName
The genome build name parsed from the directive
Type: string
GFF3Comment
A GFF3 comment
comment
The text of the comment
Type: string
GFF3Sequence
A GFF3 FASTA single sequence
id
The ID of the sequence
Type: string
description
The description of the sequence
Type: string
sequence
The sequence
Type: string
License
MIT © Robert Buels