@gmod/gff

read and write GFF3 data as streams

Downloads in past

Stats

StarsIssuesVersionUpdatedCreatedSize
@gmod/gff
1141.3.0a year ago6 years agoMinified + gzip package size for @gmod/gff in KB

Readme

@gmod/gff
Build Status
Read and write GFF3 data performantly. This module aims to be a complete implementation of the GFF3 specification.
  • streaming parsing and streaming formatting
  • proper escaping and unescaping of attribute and column values
  • supports features with multiple locations and features with multiple parents
  • reconstructs feature hierarchies of both Parent and Derives_from relationships
  • parses FASTA sections
  • does no validation except for referential integrity of Parent and Derives_from relationships
  • only compatible with GFF3

Install

$ npm install --save @gmod/gff

Usage

const gff = require('@gmod/gff').default
// or in ES6 (recommended)
import gff from '@gmod/gff'

const fs = require('fs')

// parse a file from a file name
// parses only features and sequences by default,
// set options to parse directives and/or comments
fs.createReadStream('path/to/my/file.gff3')
  .pipe(gff.parseStream({ parseAll: true }))
  .on('data', (data) => {
    if (data.directive) {
      console.log('got a directive', data)
    } else if (data.comment) {
      console.log('got a comment', data)
    } else if (data.sequence) {
      console.log('got a sequence from a FASTA section')
    } else {
      console.log('got a feature', data)
    }
  })

// parse a string of gff3 synchronously
const stringOfGFF3 = fs.readFileSync('my_annotations.gff3').toString()
const arrayOfThings = gff.parseStringSync(stringOfGFF3)

// format an array of items to a string
const newStringOfGFF3 = gff.formatSync(arrayOfThings)

// format a stream of things to a stream of text.
// inserts sync marks automatically.
myStreamOfGFF3Objects
  .pipe(gff.formatStream())
  .pipe(fs.createWriteStream('my_new.gff3'))

// format a stream of things and write it to
// a gff3 file. inserts sync marks and a
// '##gff-version 3' header if one is not
// already present
gff.formatFile(
  myStreamOfGFF3Objects,
  fs.createWriteStream('my_new_2.gff3', { encoding: 'utf8' }),
)

Object format

features

In GFF3, features can have more than one location. We parse features as arrayrefs of all the lines that share that feature's ID. Values that are . in the GFF3 are null in the output.
A simple feature that's located in just one place:
[
  {
    "seq_id": "ctg123",
    "source": null,
    "type": "gene",
    "start": 1000,
    "end": 9000,
    "score": null,
    "strand": "+",
    "phase": null,
    "attributes": {
      "ID": [
        "gene00001"
      ],
      "Name": [
        "EDEN"
      ]
    },
    "child_features": [],
    "derived_features": []
  }
]

A CDS called cds00001 located in two places:
[
  {
    "seq_id": "ctg123",
    "source": null,
    "type": "CDS",
    "start": 1201,
    "end": 1500,
    "score": null,
    "strand": "+",
    "phase": "0",
    "attributes": {
      "ID": ["cds00001"],
      "Parent": ["mRNA00001"]
    },
    "child_features": [],
    "derived_features": []
  },
  {
    "seq_id": "ctg123",
    "source": null,
    "type": "CDS",
    "start": 3000,
    "end": 3902,
    "score": null,
    "strand": "+",
    "phase": "0",
    "attributes": {
      "ID": ["cds00001"],
      "Parent": ["mRNA00001"]
    },
    "child_features": [],
    "derived_features": []
  }
]

directives

parseDirective("##gff-version 3\n")
// returns
{
  "directive": "gff-version",
  "value": "3"
}

parseDirective('##sequence-region ctg123 1 1497228\n')
// returns
{
  "directive": "sequence-region",
  "value": "ctg123 1 1497228",
  "seq_id": "ctg123",
  "start": "1",
  "end": "1497228"
}

comments

parseComment('# hi this is a comment\n')
// returns
{
  "comment": "hi this is a comment"
}

sequences

These come from any embedded ##FASTA section in the GFF3 file.
parseSequences(`##FASTA
>ctgA test contig
ACTGACTAGCTAGCATCAGCGTCGTAGCTATTATATTACGGTAGCCA`)
// returns
[
  {
    "id": "ctgA",
    "description": "test contig",
    "sequence": "ACTGACTAGCTAGCATCAGCGTCGTAGCTATTATATTACGGTAGCCA"
  }
]

API

Table of Contents

- encoding - parseFeatures - parseDirectives - parseComments - parseSequences - parseAll - bufferSize - Parameters - Parameters - Parameters - Parameters - Parameters

ParseOptions

Parser options

encoding

Text encoding of the input GFF3. default 'utf8'
Type: BufferEncoding

parseFeatures

Whether to parse features, default true
Type: boolean

parseDirectives

Whether to parse directives, default false
Type: boolean

parseComments

Whether to parse comments, default false
Type: boolean

parseSequences

Whether to parse sequences, default true
Type: boolean

parseAll

Parse all features, directives, comments, and sequences. Overrides other parsing options. Default false.
Type: boolean

bufferSize

Maximum number of GFF3 lines to buffer, default 1000
Type: number

parseStream

Parse a stream of text data into a stream of feature, directive, comment, an sequence objects.

Parameters


Returns GFFTransform stream (in objectMode) of parsed items

parseStringSync

Synchronously parse a string containing GFF3 and return an array of the parsed items.

Parameters

  • str string GFF3 string
  • inputOptions ({encoding: BufferEncoding?, bufferSize: number?} | undefined)? Parsing options

Returns Array<(GFF3Feature | GFF3Sequence)> array of parsed features, directives, comments and/or sequences

formatSync

Format an array of GFF3 items (features,directives,comments) into string of GFF3. Does not insert synchronization (###) marks.

Parameters

  • items Array\ Array of features, directives, comments and/or sequences

Returns
string the formatted GFF3

formatStream

Format a stream of features, directives, comments and/or sequences into a stream of GFF3 text.
Inserts synchronization (###) marks automatically.

Parameters

  • options FormatOptions parser options (optional, default {})

Returns
FormattingTransform

formatFile

Format a stream of features, directives, comments and/or sequences into a GFF3 file and write it to the filesystem.
Inserts synchronization (###) marks and a ##gff-version directive automatically (if one is not already present).

Parameters

  • stream Readable the stream to write to the file
  • writeStream Writable
  • options FormatOptions parser options (optional, default {})
  • filename the file path to write to

Returns
Promise\ promise for null that resolves when the stream has been written

About util

There is also a util module that contains super-low-level functions for dealing with lines and parts of lines.
// non-ES6
const util = require('@gmod/gff').default.util
// or, with ES6
import gff from '@gmod/gff'
const util = gff.util

const gff3Lines = util.formatItem({
  seq_id: 'ctgA',
  ...
}))

util

Table of Contents

- Parameters
- Parameters - Parameters - Parameters - Parameters - Parameters - Parameters - Parameters - Parameters - Parameters - Parameters - Parameters - seqid - source - type - start - end - score - strand - phase - attributes - childfeatures - derivedfeatures - directive - value - value - seqid - start - end - value - source - buildName - comment - id - description - sequence

unescape

Unescape a string value used in a GFF3 attribute.

Parameters

  • stringVal string Escaped GFF3 string value

Returns string An unescaped string value

escape

Escape a value for use in a GFF3 attribute value.

Parameters


Returns
string An escaped string value

escapeColumn

Escape a value for use in a GFF3 column value.

Parameters


Returns
string An escaped column value

parseAttributes

Parse the 9th column (attributes) of a GFF3 feature line.

Parameters

  • attrString string String of GFF3 9th column

Returns
GFF3Attributes Parsed attributes

parseFeature

Parse a GFF3 feature line

Parameters


Returns
GFF3FeatureLine The parsed feature

parseDirective

Parse a GFF3 directive line.

Parameters

  • line string GFF3 directive line

Returns
(GFF3Directive | GFF3SequenceRegionDirective | GFF3GenomeBuildDirective | null) The parsed directive

formatAttributes

Format an attributes object into a string suitable for the 9th column of GFF3.

Parameters


Returns
string GFF3 9th column string

formatFeature

Format a feature object or array of feature objects into one or more lines of GFF3.

Parameters


Returns
string A string of one or more GFF3 lines

formatDirective

Format a directive into a line of GFF3.

Parameters


Returns
string A directive line string

formatComment

Format a comment into a GFF3 comment. Yes I know this is just adding a # and a newline.

Parameters


Returns
string A comment line string

formatSequence

Format a sequence object as FASTA

Parameters


Returns
string Formatted single FASTA sequence string

formatItem

Format a directive, comment, sequence, or feature, or array of such items, into one or more lines of GFF3.

Parameters


Returns
(string | Array<string>) A formatted string or array of strings

GFF3Attributes

A record of GFF3 attribute identifiers and the values of those identifiers
Type: Record<string
, (Array<string> | undefined)>

GFF3FeatureLine

A representation of a single line of a GFF3 file

seqid

The ID of the landmark used to establish the coordinate system for the current feature
Type: (string | null)

source

A free text qualifier intended to describe the algorithm or operating procedure that generated this feature
Type: (string | null)

type

The type of the feature
Type: (string | null)

start

The start coordinates of the feature
Type: (number | null)

end

The end coordinates of the feature
Type: (number | null)

score

The score of the feature
Type: (number | null)

strand

The strand of the feature
Type: (string | null)

phase

For features of type "CDS", the phase indicates where the next codon begins relative to the 5' end of the current CDS feature
Type: (string | null)

attributes

Feature attributes
Type: (GFF3Attributes | null)

GFF3FeatureLineWithRefs

Extends GFF3FeatureLine
A GFF3 Feature line that includes references to other features defined in their "Parent" or "Derivesfrom" attributes

childfeatures

An array of child features
Type: Array
<GFF3Feature>

derivedfeatures

An array of features derived from this feature
Type: Array<GFF3Feature>

GFF3Feature

A GFF3 feature, which may include multiple individual feature lines
Type: Array<GFF3FeatureLineWithRefs>

GFF3Directive

A GFF3 directive

directive

The name of the directive
Type: string

value

The string value of the directive
Type: string

GFF3SequenceRegionDirective

Extends GFF3Directive
A GFF3 sequence-region directive

value

The string value of the directive
Type: string

seqid

The sequence ID parsed from the directive
Type: string

start

The sequence start parsed from the directive
Type: string

end

The sequence end parsed from the directive
Type: string

GFF3GenomeBuildDirective

Extends GFF3Directive
A GFF3 genome-build directive

value

The string value of the directive
Type: string

source

The genome build source parsed from the directive
Type: string

buildName

The genome build name parsed from the directive
Type: string

GFF3Comment

A GFF3 comment

comment

The text of the comment
Type: string

GFF3Sequence

A GFF3 FASTA single sequence

id

The ID of the sequence
Type: string

description

The description of the sequence
Type: string

sequence

The sequence
Type: string

License

MIT © Robert Buels