HiSeq 3000 File Outputs
-
Each Flowcell (run) contains eight lanes, and each lane may be indexed ("barcoded") with a different type of kit, e.g., single 6bp, dual 8bp, or other
-
Lanes from several customers using an identical type of indexing kit are demultiplexed into FASTQ files together in the same output directory, e.g.,
-
/nfs2/hts/illumina/170726_J00107_0144_AHK2KVBBXX_1412/L12367
-
Internal (OSU) customers are responsible for copying out their FASTQ results per the HTS Run Policy
-
External customers will have their FASTQ results staged on a website for download
-
Illumina and FastQC reports are generated for all FASTQ files and published to a website for review
-
Questions regarding the sequencing data output can be addressed by Matthew Peterson
HiSeq 3000 Features
There are several, significant differences between the previous Illumina HiSeq 3000 and HiSeq 2000 platforms:
-
The HiSeq 3000 uses a new “Patterned Flow Cell Technology”
-
Illumina’s specification lists a minimum of approximately 262.5 million clusters passing filter (PF) per flow cell lane
-
Comparing the HiSeq 3000 to the 2000 in production, we have observed in excess of 300 million paired reads per lane (an increase from approximately 175 million with the 2000)
-
Supported read lengths have increased, i.e., 2x150 bp on the HiSeq 3000 vs. 2x100 bp on the HiSeq 2000
-
Sequencing time has decreased, e.g., a 2x100 bp run in 2.5 days on the HiSeq 3000 vs. 2 weeks on the HiSeq 2000
-
Sample Preparation
-
Not all Illumina sample preparation libraries are officially supported yet; currently supported libraries include:
-
DNA: TruSeq Nano DNA (350 bp insert only), TruSeq PCR-Free DNA (350 bp insert only)
-
RNA: TruSeq RNA v2, TruSeq mRNA Stranded, TruSeq Total RNA Stranded, TruSeq RNA Access
-
Exome: Nextera Rapid Capture Exome
-
Contact Mark Dasenko regarding the submission of libraries that are not officially supported.
-
We have successfully run the following unsupported library types in our initial beta tests:
-
Nextera XT
-
Wafergen DNA (>350 bp insert)
-
Wafergen Stranded RNA
-
Nextera Mate Pair
-
GBS libraries
-
The HiSeq 3000 preferentially sequences shorter fragments, thus clean up of adapter dimers in your libraries is critical.
-
e.g., 5% adapter dimer content can result in 60% adapter sequence.
-
It is highly recommended to check for adapter dimers using the bioanalyzer HS-DNA chip.
-
Insert sizes of up to 350 bp are supported on the HiSeq 3000 vs. >550 bp on the HiSeq 2000.
-
Longer insert sizes may result in polyclonal clusters, i.e., clusters which span two wells, which will not pass filter (PF) and will not appear in the FASTQ output.
-
Loading concentrations
-
Optimal loading concentration will result in many unique reads, although the Q-scores may be slightly lower.
-
Lower, non-optimal concentrations will yield approximately the same number of reads with slightly higher Q-scores, but they will also contain numerous duplicate reads.
-
Too high of a concentration leads to polyclonal clusters, which will not pass filter (PF) and will not appear in the FASTQ output.
-
With patterned FC technology, it is significantly harder to overcluster a run than with to the HiSeq 2000, but overclustering can still occur.
-
FASTQ output
-
“Passing Filter” (PF%) interpretation
-
The new patterned flow cells contain a finite number of ordered wells to contain clusters (approximately 482.68 million).
-
The PF% represents the number of clusters that have passed Illumina’s chastity filter out of the theoretical maximum, e.g.,
-
A 60% PF rate implies a yield of approximately 289 million sequences (0.6 * 482 million; exceeding Illumina’s specifications).
-
Types of wells and clusters that do not pass filter that lead to a lower PF%:
-
Empty wells
-
Dim clusters
-
Low-quality clusters
-
Polyclonal clusters
-
The FASTQ files you receive will only contain clusters that are passing filter (PF).
-
Empty flow cell wells are an example of clusters that do not pass filter (PF), which if included in the FASTQ files would be represented by tens of millions of sequences composed of Ns.
-
The Illumina pipeline creates a single fastq.gz file containing all of the sequences per indexed sample, per read.
-
The HiSeq 2000 splits identical data into several files containing at most 4 million sequences per file by default.
-
The Illumina pipeline creates fastq.gz files using the Blocked GNU Zip Format (BGZF)
-
The metadata in BGZF compressed FASTQ files may allow for greater random access by some bioinformatics programs.
-
The majority of bioinformatic programs and Linux utilities can read BGZF compressed files in an identical fashion as “standard” gzip files.
-
A workaround for programs that cannot use the BGZF compressed FASTQ files is to recompress them without using BGZF, e.g.,
-
zcat original.fastq.gz | gzip - > new.fastq.gz
-
Q-score binning
-
To save space (via data compression), Q-scores for the 3000 are binned into one of 8 values instead of (at least) 40 values in HiSeq 2000 FASTQ files
-
The range of the bins is subject to change with HiSeq 3000 software upgrades.
-
Current HiSeq 3000 bins (version 2.10.1)
-
10-19 -> 12
-
20-24 -> 22
-
25-29 -> 27
-
30-34 -> 32
-
35-39 -> 37
-
40-43 -> 41
-
3rd party FASTQ sequence quality trimming programs may be affected by this new binning system.
-
Index reads (“barcodes”) on the HiSeq 3000 are demultiplexed using the default settings, allowing for a 1 base mismatch.
-
The HiSeq 2000 demultiplexing step allows for a 0 base mismatch by default, i.e., indices must be perfectly matched to be “binned”
-
Sample libraries that do not contain sufficient diversity in their index reads require demultiplexing using 0 mismatches, e.g.,
-
Illumina TruSeq small RNA libraries (not currently unsupported) and Wafergen RNA-seq libraries