Important: This documentation covers Yarn 1 (Classic).
For Yarn 2+ docs and migration guide, see yarnpkg.com.

Package detail

@gmod/gff

GMOD21.5kMIT2.0.0TypeScript support: included

read and write GFF3 data as streams

bionode, biojs, gff3, gff, genomics

readme

@gmod/gff

Build Status

Read and write GFF3 data performantly. This module aims to be a complete implementation of the GFF3 specification.

NOTE: this module uses the NPM stream package, which requires node.js polyfills for use on the web. We also created the https://github.com/cmdcolin/gff-nostream package to allow a non-streaming version that does not require polyfills

  • streaming parsing and streaming formatting
  • proper escaping and unescaping of attribute and column values
  • supports features with multiple locations and features with multiple parents
  • reconstructs feature hierarchies of both Parent and Derives_from relationships
  • parses FASTA sections
  • does no validation except for referential integrity of Parent and Derives_from relationships (can disable Derives_from reference checking with disableDerivesFromReferences)
  • only compatible with GFF3

Compatability

Works in the browser and with Node.js v18 and up.

Install

$ npm install --save @gmod/gff

Usage

Node.js example

import {
  createReadStream,
  createWriteStream,
  readFileSync,
  writeFileSync,
} from 'fs'
import { Readable, Writable } from 'stream'
import { TransformStream } from 'stream/web'
import {
  formatSync,
  parseStringSync,
  GFFTransformer,
  GFFFormattingTransformer,
} from '@gmod/gff'

// parse a file from a file name. parses only features and sequences by default,
// set options to parse directives and/or comments
;(async () => {
  const readStream = createReadStream('/path/to/my/file.gff3')
  const streamOfGFF3 = Readable.toWeb(readStream).pipeThrough(
    new TransformStream(
      new GFFTransformer({ parseComments: true, parseDirectives: true }),
    ),
  )
  for await (const data of streamOfGFF3) {
    if ('directive' in data) {
      console.log('got a directive', data)
    } else if ('comment' in data) {
      console.log('got a comment', data)
    } else if ('sequence' in data) {
      console.log('got a sequence from a FASTA section')
    } else {
      console.log('got a feature', data)
    }
  }

  // parse a string of gff3 synchronously
  const stringOfGFF3 = readFileSync('/path/to/my/file.gff3', 'utf8')
  const arrayOfGFF3ITems = parseStringSync(stringOfGFF3)

  // format an array of items to a string
  const newStringOfGFF3 = formatSync(arrayOfGFF3ITems)
  writeFileSync('/path/to/new/file.gff3', newStringOfGFF3)

  // read a file, format it, and write it to a new file. inserts sync marks and
  // a '##gff-version 3' header if one is not already present
  await Readable.toWeb(createReadStream('/path/to/my/file.gff3'))
    .pipeThrough(
      new TransformStream(
        new GFFTransformer({ parseComments: true, parseDirectives: true }),
      ),
    )
    .pipeThrough(new TransformStream(new GFFFormattingTransformer()))
    .pipeTo(Writable.toWeb(createWriteStream('/path/to/my/file.gff3')))
})()

Browser example

import { GFFTransformer } from '@gmod/gff'

// parse a file from a URL. parses only features and sequences by default, set
// options to parse directives and/or comments
;(async () => {
  const response = await fetch('http://example.com/file.gff3')
  if (!response.ok) {
    throw new Error('Bad response')
  }
  if (!response.body) {
    throw new Error('No response body')
  }
  const streamOfGFF3 = response.body.pipeThrough(
    new TransformStream(
      new GFFTransformer({ parseComments: true, parseDirectives: true }),
    ),
  )
  for await (const chunk of streamOfGFF3) {
    if ('directive' in data) {
      console.log('got a directive', data)
    } else if ('comment' in data) {
      console.log('got a comment', data)
    } else if ('sequence' in data) {
      console.log('got a sequence from a FASTA section')
    } else {
      console.log('got a feature', data)
    }
  }
})()

Object format

features

In GFF3, features can have more than one location. We parse features as arrayrefs of all the lines that share that feature's ID. Values that are . in the GFF3 are null in the output.

A simple feature that's located in just one place:

[
  {
    "seq_id": "ctg123",
    "source": null,
    "type": "gene",
    "start": 1000,
    "end": 9000,
    "score": null,
    "strand": "+",
    "phase": null,
    "attributes": {
      "ID": ["gene00001"],
      "Name": ["EDEN"]
    },
    "child_features": [],
    "derived_features": []
  }
]

A CDS called cds00001 located in two places:

[
  {
    "seq_id": "ctg123",
    "source": null,
    "type": "CDS",
    "start": 1201,
    "end": 1500,
    "score": null,
    "strand": "+",
    "phase": "0",
    "attributes": {
      "ID": ["cds00001"],
      "Parent": ["mRNA00001"]
    },
    "child_features": [],
    "derived_features": []
  },
  {
    "seq_id": "ctg123",
    "source": null,
    "type": "CDS",
    "start": 3000,
    "end": 3902,
    "score": null,
    "strand": "+",
    "phase": "0",
    "attributes": {
      "ID": ["cds00001"],
      "Parent": ["mRNA00001"]
    },
    "child_features": [],
    "derived_features": []
  }
]

directives

parseDirective("##gff-version 3\n")
// returns
{
  "directive": "gff-version",
  "value": "3"
}
parseDirective('##sequence-region ctg123 1 1497228\n')
// returns
{
  "directive": "sequence-region",
  "value": "ctg123 1 1497228",
  "seq_id": "ctg123",
  "start": "1",
  "end": "1497228"
}

comments

parseComment('# hi this is a comment\n')
// returns
{
  "comment": "hi this is a comment"
}

sequences

These come from any embedded ##FASTA section in the GFF3 file.

parseSequences(`##FASTA
>ctgA test contig
ACTGACTAGCTAGCATCAGCGTCGTAGCTATTATATTACGGTAGCCA`)[
  // returns
  {
    id: 'ctgA',
    description: 'test contig',
    sequence: 'ACTGACTAGCTAGCATCAGCGTCGTAGCTATTATATTACGGTAGCCA',
  }
]

API

Table of Contents

ParseOptions

Parser options

disableDerivesFromReferences

Whether to resolve references to derives from features

Type: boolean

parseFeatures

Whether to parse features, default true

Type: boolean

parseDirectives

Whether to parse directives, default false

Type: boolean

parseComments

Whether to parse comments, default false

Type: boolean

parseSequences

Whether to parse sequences, default true

Type: boolean

bufferSize

Maximum number of GFF3 lines to buffer, default Infinity

Type: number

GFFTransformer

Parse a stream of text data into a stream of feature, directive, comment, an sequence objects.

Parameters

  • options O? Parser options

parseStringSync

Synchronously parse a string containing GFF3 and return an array of the parsed items.

Parameters

  • str string GFF3 string
  • inputOptions O? Parsing options

Returns any array of parsed features, directives, comments and/or sequences

formatSync

Format an array of GFF3 items (features,directives,comments) into string of GFF3. Does not insert synchronization (###) marks.

Parameters

  • items Array<GFF3Item> Array of features, directives, comments and/or sequences

Returns string the formatted GFF3

FormatOptions

Formatter options

minSyncLines

The minimum number of lines to emit between sync (###) directives, default 100

Type: number

insertVersionDirective

Whether to insert a version directive at the beginning of a formatted stream if one does not exist already, default true

Type: boolean

GFFFormattingTransformer

Transform a stream of features, directives, comments and/or sequences into a stream of GFF3 text.

Inserts synchronization (###) marks automatically.

Parameters

About util

There is also a util module that contains super-low-level functions for dealing with lines and parts of lines.

import { util } from '@gmod/gff'

const gff3Lines = util.formatItem({
  seq_id: 'ctgA',
  ...
})

util

Table of Contents

unescape

Unescape a string value used in a GFF3 attribute.

Parameters

  • stringVal string Escaped GFF3 string value

Returns string An unescaped string value

escape

Escape a value for use in a GFF3 attribute value.

Parameters

Returns string An escaped string value

escapeColumn

Escape a value for use in a GFF3 column value.

Parameters

Returns string An escaped column value

parseAttributes

Parse the 9th column (attributes) of a GFF3 feature line.

Parameters

  • attrString string String of GFF3 9th column

Returns GFF3Attributes Parsed attributes

parseFeature

Parse a GFF3 feature line

Parameters

  • line string GFF3 feature line

Returns GFF3FeatureLine The parsed feature

parseDirective

Parse a GFF3 directive line.

Parameters

  • line string GFF3 directive line

Returns (GFF3Directive | GFF3SequenceRegionDirective | GFF3GenomeBuildDirective | null) The parsed directive

formatAttributes

Format an attributes object into a string suitable for the 9th column of GFF3.

Parameters

Returns string GFF3 9th column string

formatFeature

Format a feature object or array of feature objects into one or more lines of GFF3.

Parameters

Returns string A string of one or more GFF3 lines

formatDirective

Format a directive into a line of GFF3.

Parameters

  • directive GFF3Directive A directive object

Returns string A directive line string

formatComment

Format a comment into a GFF3 comment. Yes I know this is just adding a # and a newline.

Parameters

Returns string A comment line string

formatSequence

Format a sequence object as FASTA

Parameters

Returns string Formatted single FASTA sequence string

formatItem

Format a directive, comment, sequence, or feature, or array of such items, into one or more lines of GFF3.

Parameters

Returns string A formatted string or array of strings

GFF3Attributes

A record of GFF3 attribute identifiers and the values of those identifiers

Type: Record<string, Array<string>>

GFF3FeatureLine

A representation of a single line of a GFF3 file

seq_id

The ID of the landmark used to establish the coordinate system for the current feature

Type: (string | null)

source

A free text qualifier intended to describe the algorithm or operating procedure that generated this feature

Type: (string | null)

type

The type of the feature

Type: (string | null)

start

The start coordinates of the feature

Type: (number | null)

end

The end coordinates of the feature

Type: (number | null)

score

The score of the feature

Type: (number | null)

strand

The strand of the feature

Type: (string | null)

phase

For features of type "CDS", the phase indicates where the next codon begins relative to the 5' end of the current CDS feature

Type: (string | null)

attributes

Feature attributes

Type: (GFF3Attributes | null)

GFF3FeatureLineWithRefs

Extends GFF3FeatureLine

A GFF3 Feature line that includes references to other features defined in their "Parent" or "Derives_from" attributes

child_features

An array of child features

Type: Array<GFF3Feature>

derived_features

An array of features derived from this feature

Type: Array<GFF3Feature>

GFF3Feature

A GFF3 feature, which may include multiple individual feature lines

Type: Array<GFF3FeatureLineWithRefs>

BaseGFF3Directive

A GFF3 directive

directive

The name of the directive

Type: string

value

The string value of the directive

Type: string

GFF3SequenceRegionDirective

Extends BaseGFF3Directive

A GFF3 sequence-region directive

value

The string value of the directive

Type: string

seq_id

The sequence ID parsed from the directive

Type: string

start

The sequence start parsed from the directive

Type: string

end

The sequence end parsed from the directive

Type: string

GFF3GenomeBuildDirective

Extends BaseGFF3Directive

A GFF3 genome-build directive

value

The string value of the directive

Type: string

source

The genome build source parsed from the directive

Type: string

buildName

The genome build name parsed from the directive

Type: string

GFF3Comment

A GFF3 comment

comment

The text of the comment

Type: string

GFF3Sequence

A GFF3 FASTA single sequence

id

The ID of the sequence

Type: string

description

The description of the sequence

Type: string

sequence

The sequence

Type: string

License

MIT © Robert Buels

changelog

v2.0.0

  • Parsing and formatting of streams has been converted from Node.js streams to web streams
    • parseStream and formatStream were removed
    • GFFTransformer and GFFFormattingTransformer were added
  • parseAll and encoding options of the parser have been removed
  • bufferSize option of the parser now defaults to Infinity

v1.3.0

  • Added stream-browserify polyfill