Checksum

FastQC

Modern high throughput sequencers can generate hundreds of millions of sequences in a single run. Before analysing this sequence to draw biological conclusions you should always perform some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in your data which may affect how you can usefully use it.

Most sequencers will generate a QC report as part of their analysis pipeline, but this is usually only focused on identifying problems which were generated by the sequencer itself. FastQC aims to provide a QC report which can spot problems which originate either in the sequencer or in the starting library material.

FastQC can be run in one of two modes. It can either run as a stand alone interactive application for the immediate analysis of small numbers of FastQ files, or it can be run in a non-interactive mode where it would be suitable for integrating into a larger analysis pipeline for the systematic processing of large numbers of files.

Basic Statistics

Summary

The Basic Statistics module generates some simple composition statistics for the file analysed.

  • File Name:The original filename of the file which was analysed
  • File type:Says whether the file appeared to contain actual base calls or colorspace data which had to be converted to base calls
  • Phred Encoding:Says which ASCII encoding of quality values was found in this file.
  • Total Reads:A count of the total number of sequences processed.
  • Total Bases:A count of the total number of bases in all sequences processed.
  • Total T Bases:A count of the total number of base 'T' in all sequences processed.
  • Total C Bases:A count of the total number of base 'C' in all sequences processed.
  • Total G Bases:A count of the total number of base 'G' in all sequences processed.
  • Total A Bases:A count of the total number of base 'A' in all sequences processed.
  • Total N Bases: A count of the total number of base 'N' in all sequences processed , where 'N' is the base that cannot be recognized by sequencing.
  • %GC:The overall %GC of all bases in all sequences
  • Min Length:The length of the shortest sequence in the set.
  • Max Length:The length of the highest sequence in the set.
  • Lowest Char: The lowest quality char in all sequences processed.
  • Highest Char: The highest quality char in all sequences processed.

Example

"basic_stats":{
            "file_name":"test.fastq.gz",
            "file_type":"",
            "phred":{
                "name":"Sanger / Illumina 1.9",
                "offset":33
            },
            "total_reads":250000,
            "total_bases":37500000,
            "t_count":8473918,
            "c_count":10050719,
            "g_count":9888266,
            "a_count":9086549,
            "n_count":548,
            "gc_percentage":0.5317062666666666,
            "min_length":150,
            "max_length":150,
            "lowest_char":35,
            "highest_char":70
        },

Per Base Sequence Quality

Summary

The Per Base Sequence Quality module shows an overview of the range of quality values across all bases at each position in the FastQ file. It divides all positions into several groups which contain a certain digit of position or the range of positions.

  • X-labels:It divides all positions into several groups which contain a certain digit of position or the range of positions.
  • Mean:For each position or range of position in x-labels, It calculates the average quality scores for all sequences.
  • Median:For each position or range of position in x-labels, It calculates the median quality scores for all sequences.
  • Lower Quartile:For each position or range of position in x-labels, It calculates the lower quartile(25%) quality scores for all sequences.
  • Upper Quartile:For each position or range of position in x-labels, It calculates the higher quartile(75%) quality scores for all sequences.
  • Lowest:For each position or range of position in x-labels, It calculates the lowest quality scores for all sequences.
  • Highest:For each position or range of position in x-labels, It calculates the highest quality scores for all sequences.

Example

"per_base_seq_quality": {
            "xlabels": ["1","2","3","4","5","6","7","8","9","10-14","15-19","20-24","25-29","30-34","35-39",...],
            "mean": [36.307356,36.290732,36.46844,36.51792,36.521944,36.526816,36.473336,
36.543864,36.515792,36.563705600000006,36.560803199999995,36.5081312,36.4498688,36.42268,36.3846096, ...],
            "median": [37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,...],
            "lower_quartile": [37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,... ],
            "upper_quartile": [37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,... ],
            "lowest": [37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,... ],
            "highest": [37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,... ],
        },

Per Sequence Quality Score

Summary

The Per Sequence Quality Score report describes allows you to see if a subset of your sequences have universally low quality values. It is often the case that a subset of sequences will have universally poor quality, often because they are poorly imaged (on the edge of the field of view etc), however these should represent only a small percentage of the total sequences.

If a significant proportion of the sequences in a run have overall low quality then this could indicate some kind of systematic problem - possibly with just part of the run (for example one end of a flowcell).

Results from this module will not be displayed if your input is a BAM/SAM file in which quality scores have not been recorded.

  • X- category quality:It's a set of distinct quality scores in all sequences processed, and all scores are sorted in an ascending order.
  • Y-category count:It records the quality count in all sequences processed for each quality score in x-category quality.
  • Most frequent score:The quality score that occurs the most

Example

"per_seq_quality_score": {
            "x_category_quality": [16,17,18,19,20,21,22,23,24,25,...],
            "y_category_count": [182111,5,4,9,37,61,126,342,479,485,...],
            "most_frequent_score": 16
        },

Per Base Sequence Content

Summary

Per Base Sequence Content module shows the proportion of each base position in a file for which each of the four normal DNA bases has been called.

In a random library you would expect that there would be little to no difference between the different bases of a sequence run, so the lines in this plot should run parallel with each other. The relative amount of each base should reflect the overall amount of these bases in your genome, but in any case they should not be hugely imbalanced from each other.

It's worth noting that some types of library will always produce biased sequence composition, normally at the start of the read. Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.

  • X-category:It divides all base positions into several groups which contain a certain digit of position or the range of positions.
  • G counts:The counts of each base position in a file for which ’G‘ base has been called.
  • C counts:The counts of each base position in a file for which ’C‘ base has been called.
  • A counts:The counts of each base position in a file for which ’A‘ base has been called.
  • T counts:The counts of each base position in a file for which ’T‘ base has been called.
  • Percentages:The proportion of each base position in a file for which each of the four normal DNA bases has been called.Its order in storage is [t_percent_array, c_percent_array, a_percent_array, g_percent_array]

Example

"per_base_seq_content": {
            "x_category": ["1","2","3","4","5","6","7","8","9","10-14","15-19","20-24","25-29",...],
            "g_counts": [
            	115472,
                80383,
                76362,
                83320,
                81674,
                80780,
                61621,
                59846,
                61399,
                60254,
                67033,
                64991,
                65134,
                ...
            ],
            "c_counts": [
            	63849,
                69603,
                80458,
                75224,
                60497,
                51506,
                49210,
                57837,
                55266,
                54779,
                59960,
                60946,
                61423,
                ...
            ],
            "a_counts": [
            	38933,
                29960,
                35063,
                41477,
                50424,
                59984,
                60756,
                62327,
                58272,
                72997,
                65287,
                64525,
                64301,
                ...
            ],
            "t_counts": [
            	31663,
                69593,
                58117,
                49979,
                57405,
                57728,
                78413,
                69990,
                75063,
                61970,
                57718,
                59538,
                59142,
                ...
            ],
            "percentages": [
                [
                    12.669406242872633,
                    27.88862662750111,
                    23.2468,
                    19.991600000000002,
                    22.962,
                    23.09138473107785,
                    31.365199999999998,
                    27.996,
                    30.0252,
                    23.834518135229015,
                    22.720000000000002,
                    23.19664,
                    22.7348,
                    ...
                ],
                [
                    25.54808196321179,
                    27.892634017127584,
                    32.1832,
                    30.0896,
                    24.198800000000002,
                    20.602564820518566,
                    19.683999999999997,
                    23.1348,
                    22.1064,
                    23.88315821305314,
                    25.505119999999998,
                    25.90864,
                    27.118,
                    ...
                ],
                [
                    15.578372019510478,
                    12.006139320907753,
                    14.025199999999998,
                    16.5908,
                    20.1696,
                    23.993791950335602,
                    24.3024,
                    24.9308,
                    23.308799999999998,
                    26.43276229241967,
                    24.285680000000003,
                    24.44032,
                    25.01016,
                    ...
                ],
                [
                    46.2041397744051,
                    32.21260003446355,
                    30.5448,
                    33.328,
                    32.669599999999996,
                    32.31225849806798,
                    24.648400000000002,
                    23.9384,
                    24.5596,
                    25.849561359298175,
                    27.489200000000004,
                    26.4544,
                    25.13704,
                    ...
                ]
            ]
  },

Per Sequence GC Content

Summary

This module measures the GC content across the whole length of each sequence in a file and compares it to a modelled normal distribution of GC content.

In a normal random library you would expect to see a roughly normal distribution of GC content where the central peak corresponds to the overall GC content of the underlying genome. Since we don't know the the GC content of the genome the modal GC content is calculated from the observed data and used to build a reference distribution.

An unusually shaped distribution could indicate a contaminated library or some other kinds of biased subset. A normal distribution which is shifted indicates some systematic bias which is independent of base position. If there is a systematic bias which creates a shifted normal distribution then this won't be flagged as an error by the module since it doesn't know what your genome's GC content should be.

  • X-category:It's a set of mean %GC.
  • Y-gc distribution:For each %GC in x-category,we count the number of corresponding sentences.
  • Y-theoretic distribution: The theoretic number of corresponding sentences for each %GC in x-category.
  • Deviation percent:Records the sum of the deviations for all corresponding positions of y-gc distribution and y-theoretic distribution.

Example

"per_seq_gc_content": {
            "x_category": [
                0,
                1,
                2,
                3,
                4,
                5,
                6,
                7,
                8,
                9,
                10,
                11,
                12,
                13,
                ...
            ],
            "y_gc_distribution": [
                0,
                0,
                0.5,
                0.5,
                0,
                0,
                0,
                1,
                1,
                3.5,
                3.5,
                0,
                3,
                8,
               ...
            ],
            "y_theo_distribution": [
                11.301763218607428,
                13.897288489846439,
                17.029620470668664,
                20.79557306799733,
                25.306252897428912,
                30.68851136823189,
                37.08641280899335,
                44.66268842218363,
                53.60013975983366,
                64.10294913058424,
                76.39784805677265,
                90.73508881215378,
                107.3891584351843,
               ...
            ],
            "deviation_percent": 42.145681622216166
        },

Per Base N Content

Summary

If a sequencer is unable to make a base call with sufficient confidence then it will normally substitute an N rather than a conventional base] call.

It's not unusual to see a very low proportion of Ns appearing in a sequence, especially nearer the end of a sequence. However, if this proportion rises above a few percent it suggests that the analysis pipeline was unable to interpret the data well enough to make valid base calls.

  • X-category:It divides all base positions into several groups which contains a certain digit of position or the range of positions.
  • N counts:Records the count of base 'N' for each position in x-category.
  • Not n counts: Records the sum of count of base 'A', base 'T', base 'C' and base 'G' for each position in x-category.
  • Percentages:The proportion of each base position in x-category for which the base 'N' has been called.

Example

"per_base_n_content": {
           "x_categories": [
                "1",
                "2",
                "3",
                "4",
                "5",
                "6",
                "7",
                "8",
                "9",
                "10-14",
                "15-19",
                "20-24",
                ...
            ],
            "n_counts": [
                83,
                461,
                0,
                0,
                0,
                2,
                0,
                0,
                0,
                0,
                2,
                0,
                ...
            ],
            "not_n_counts": [
                249917,
                249539,
                250000,
                250000,
                250000,
                249998,
                250000,
                250000,
                250000,
                250000,
                249998,
                250000,
                ...
            ],
            "percentages": [
                0.0332,
                0.18439999999999998,
                0,
                0,
                0,
                0.0007999999999999999,
                0,
                0,
                0,
                0.00015999999999999999,
                0,
                0,
                ...
            ] 
        },
			

Sequence Length Distribution

Summary

Some high throughput sequencers generate sequence fragments of uniform length, but others can contain reads of wildly varying lengths. Even within uniform length libraries some pipelines will trim sequences to remove poor quality base calls from the end.

This module shows the distribution of fragment sizes in the file which was analysed.

In many cases this will produce a simple graph showing a peak only at one size, but for variable length FastQ files this will show the relative amounts of each different size of sequence fragment.

  • X-category:A set of length.
  • Graph counts:The number of sentences for each length in x-category.

Example

"seq_len_distribution": {
            "x_categories": [
                "149",
                "150",
                "151"
            ],
            "graph_counts": [
                0,
                250000,
                0
            ]
        },

Overrepresented Sequences

Summary

A normal high-throughput library will contain a diverse set of sequences, with no individual sequence making up a tiny fraction of the whole. Finding that a single sequence is very overrepresented in the set either means that it is highly biologically significant, or indicates that the library is contaminated, or not as diverse as you expected.

This module lists all of the sequence which make up more than 0.1% of the total. To conserve memory only sequences which appear in the first 100,000 sequences are tracked to the end of the file. It is therefore possible that a sequence which is overrepresented but doesn't appear at the start of the file for some reason could be missed by this module.

For each overrepresented sequence the program will look for matches in a database of common contaminants and will report the best hit it finds. Hits must be at least 20bp in length and have no more than 1 mismatch. Finding a hit doesn't necessarily mean that this is the source of the contamination, but may point you in the right direction. It's also worth pointing out that many adapter sequences are very similar to each other so you may get a hit reported which isn't technically correct, but which has very similar sequence to the actual match.

Because the duplication detection requires an exact sequence match over the whole length of the sequence any reads over 75bp in length are truncated to 50bp for the purposes of this analysis. Even so, longer reads are more likely to contain sequencing errors which will artificially increase the observed diversity and will tend to underrepresent highly duplicated sequences.

  • count:Total number of sequences in the file.
  • Obervation cut off:The number of unique sequences we want to track in overrepresented module.
  • Unique sequence count:The number of unique sequences at present.
  • Count at unique limit: The corresponding count that has been processed when unique sequence count reaches the obervation cut off.
  • Overrepresented sequences : It's a set of overrepresented sequence with attributes "seq",“count”,"percentage" and contaminant hit", where "seq" is stand for a sequence of bases, “count” is stand for the number of sentences containing such base sequence, "percentage" is stand for the proportion of "count" in total count, and contaminant hit is stand for the hit results in file of comtaminant .

Example

"overrepresented_seqs": {
            "count": 250000,
            "observation_cut_off": 100000,
            "unique_seq_count": 100000,
            "count_at_unique_limit": 140053,
            "overrepresented_seqs": [
                {
                    "seq": "GTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGTTTTGA",
                    "count": 1941,
                    "percentage": 0.7764,
                    "contaminant_hit": null
                },
                {
                    "seq": "GCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGTTT",
                    "count": 1347,
                    "percentage": 0.5388000000000001,
                    "contaminant_hit": null
                },
                {
                    "seq": "AGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGTTTTG",
                    "count": 1291,
                    "percentage": 0.5164,
                    "contaminant_hit": null
                },
                {
                    "seq": "GTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGT",
                    "count": 916,
                    "percentage": 0.3664,
                    "contaminant_hit": null
                },
                ...
            ]
        },

Duplicate Sequence

Summary

In a diverse library most sequences will occur only once in the final set. A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication is more likely to indicate some kind of enrichment bias (eg PCR over amplification).

This module counts the degree of duplication for every sequence in a library and shows the relative number of sequences with different degrees of duplication.

To cut down on the memory requirements for this module only sequences which first appear in the first 100,000 sequences in each file are analysed, but this should be enough to get a good impression for the duplication levels in the whole file. Each sequence is tracked to the end of the file to give a representative count of the overall duplication level. To cut down on the amount of information any sequences with more than 10 duplicates are placed into grouped bins to give a clear impression of the overall duplication level without having to show each individual duplication value.

Because the duplication detection requires an exact sequence match over the whole length of the sequence, any reads over 75bp in length are truncated to 50bp for the purposes of this analysis. Even so, longer reads are more likely to contain sequencing errors which will artificially increase the observed diversity and will tend to underrepresent highly duplicated sequences.

The module also calculates an expected overall loss of sequence were the library to be deduplicated.

  • Total percentages:It takes the full sequence set and shows how its duplication levels are distributed.
  • Deduplicated percentages:The sequences are de-duplicated and the proportions are the proportions of the deduplicated set which come from different duplication levels in the original data.
  • Percentage diff: Calculates an expected overall loss of sequence were the library to be deduplicated.
  • labels : It's a set which shows different degrees of duplication.

Example

"dedup_percentages": [
                88.7270625897991,
                8.234690198626954,
                1.3275601587718149,
                0.4567248391318406,
                0.23253294096467203,
                0.16711084299112353,
                0.10936917951231077,
                0.08366047968337169,
                0.060472876913210516,
                0.4965347445786828,
                0.05984315938476192,
                0.041475456999339945,
                0.0011850130571239984,
                0.0017775195856859977,
                0,
                0
            ],
            "total_percentages": [
                62.5830541835812,
                11.616569913245959,
                2.809157665879907,
                1.2885915309298914,
                0.8200779570004981,
                0.7072232509214026,
                0.5399998558899416,
                0.47207498413461896,
                0.383887339423591,
                7.222186751326263,
                2.92962348084703,
                6.041878268559988,
                0.6720163419689051,
                1.9136584762908062,
                0,
                0
            ],
            "percent_diff_seq": 70.53434697022905,
            "labels": [
                "1",
                "2",
                "3",
                "4",
                "5",
                "6",
                "7",
                "8",
                "9",
                ">10",
                ">50",
                ">100",
                ">500",
                ">1k",
                ">5k",
                ">10k"
            ]

Adapter Content

Summary

The Kmer Content module will do a generic analysis of all of the Kmers in your library to find those which do not have even coverage through the length of your reads. This can find a number of different sources of bias in the library which can include the presence of read-through adapter sequences building up on the end of your sequences.

You can however find that the presence of any overrepresented sequences in your library (such as adapter dimers) will cause the Kmer plot to be dominated by the Kmers these sequences contain, and that it's not always easy to see if there are other biases present in which you might be interested.

One obvious class of sequences which you might want to analyse are adapter sequences. It is useful to know if your library contains a significant amount of adapter in order to be able to assess whether you need to adapter trim or not. Although the Kmer analysis can theoretically spot this kind of contamination it isn't always clear. This module therefore does a specific search for a set of separately defined Kmers and will give you a view of the total proportion of your library which contain these Kmers. A results trace will always be generated for all of the sequences present in the adapter config file so you can see the adapter content of your library, even if it's low.

Once a sequence has been seen in a read it is counted as being present right through to the end of the read so the percentages you see will only increase as the read length goes on

The module also calculates an expected overall loss of sequence were the library to be deduplicated.

  • Labels:A set of adpaters from the file of adpater.
  • X-labels:It divides all positions into several groups which contain a certain digit of position or the range of positions.
  • Enrichments: It ranges in the order of adapters in Labels and shows a cumulative percentage count of the proportion of your library which has seen each of the adapter sequences at each position.

Example

"adapter_content": {
            "labels": [
                "Illumina Universal Adapter",
                "Illumina Small RNA 3' Adapter",
                "Illumina Small RNA 5' Adapter",
                "Nextera Transposase Sequence",
                "SOLID Small RNA Adapter"
            ],
            "x_labels": [
                "1",
                "2",
                "3",
                "4",
                "5",
                "6",
                "7",
                "8",
                "9",
                "10-11",
                "12-13",
                "14-15",
                "16-17",
                ...
            ],
            "enrichments": [
                [
                    0.002,
                    0.002,
                    0.002,
                    0.002,
                    0.002,
                    0.002,
                    0.002,
                    0.002,
                    0.002,
                    0.0021999999999999997,
                    0.0024,
                    0.0024,
                    0.0024,
                    ...
                ],
                [
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    ...
                ],
                [
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    ...
                ],
                [
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    ...
                ],
                [
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    ...
                ]
            ]
        },

Kmer Content

Summary

The analysis of overrepresented sequences will spot an increase in any exactly duplicated sequences, but there are a different subset of problems where it will not work.

  • If you have very long sequences with poor sequence quality then random sequencing errors will dramatically reduce the counts for exactly duplicated sequences.
  • If you have a partial sequence which is appearing at a variety of places within your sequence then this won't be seen either by the per base content module or the duplicate sequence analysis.

The Kmer module starts from the assumption that any small fragment of sequence should not have a positional bias in its apearance within a diverse library. There may be biological reasons why certain Kmers are enriched or depleted overall, but these biases should affect all positions within a sequence equally. This module therefore measures the number of each 7-mer at each position in your library and then uses a binomial test to look for significant deviations from an even coverage at all positions. Any Kmers with positionally biased enrichment are reported. The top 6 most biased Kmer are additionally reported to show their distribution.

To allow this module to run in a reasonable time only 2% of the whole library is analysed and the results are extrapolated to the rest of the library. Sequences longer than 500bp are truncated to 500bp for this analysis.

  • Enriched Kmers: This is the full set of Kmers to be reported.
  • Enrichments:Significant deviations from an even coverage at all positions for each kmer in x-labels.
  • X-categories:It divides all positions into several groups which contain a certain digit of position or the range of positions.
  • X-labels:A set of the top 6 most biased Kmer are additionally plotted to show their distribution.

Example

        "kmer_content": {
            "enriched_kmers": [
                {
                    "sequence": "CGCATTT",
                    "count": 7,
                    "lowest_pvalue": 0.0036627573317673523
                },
                {
                    "sequence": "CTCGCTA",
                    "count": 56,
                    "lowest_pvalue": 0
                },
                {
                    "sequence": "CCCCTAT",
                    "count": 14,
                    "lowest_pvalue": 0.0010821496325661428
                },
            	...
            ],
            "enrichments": [
                [
                    0,
                    0,
                    61.712828571428574,
                    0,
                    0,
                    0,
                    0,
                    ...
                ],
                [
                    43.792079314194126,
                    23.179397750686814,
                    10.28547142857143,
                    7.714103571428572,
                    7.714103571428572,
                    7.714103571428572,
                    ...
                ],
                [
                    41.2160746486533,
                    10.301954555860807,
                    0,
                    0,
                    0,
                    0,
                    ...
                ],
                [
                    0,
                    0,
                    20.57094285714286,
                    0,
                    30.856414285714287,
                    41.14188571428572,
                    ...
                ],
                [
                    18.032032658785813,
                    40.56394606370192,
                    20.249521875,
                    8.9997875,
                    6.749840625,
                    6.749840625,
                    ...
                ],
                [
                    6.272011359577675,
                    16.72201319212189,
                    39.65123768115942,
                    18.782165217391306,
                    8.347628985507248,
                    6.260721739130435,
                    ...
                ]
            ],
            "x_categories": [
                "1",
                "2",
                "3",
                "4",
                "5",
                "6",
                "7",
                "8",
                "9",
                "10-14",
                "15-19",
                "20-24",
                ...
            ],
            "x_labels": [
                "CGCATTT",
                "CTCGCTA",
                "CCCCTAT",
                "CGATTTG",
                "TCGCTAT",
                "CGCTATG"
            ]
        },

Per Tile Sequence Quality

Summary

This module will only appear in your analysis results if you're using an Illumina library which retains its original sequence identifiers. Encoded in these is the flowcell tile from which each read came. The module allows you to look at the quality scores from each tile across all of your bases to see if there was a loss in quality associated with only one part of the flowcell.

  • X-labels:It divides all positions into several groups which contain a certain digit of position or the range of positions.

  • Tiles:A set of all tiles id found in file.

  • Means: The set of average quality for each tile.

Example

"per_tile_quality_score": {
            
            "x_labels": [
                "1",
                "2",
                "3",
                "4",
                "5",
                "6",
                "7",
                "8",
                ...
            ],
            "tiles": [
                1101,
                1102,
                1103,
                1104,
                1105,
                1106,
                1107,
                1108,
                1109,
                1110,
                1111,
                1112
            ],
            "means": [
                [
                    -0.006346050119049096,
                    -0.0654172611747228,
                    0.04078626086620574,
                    0.018562823162376674,
                    -0.07341049875898875,
                    -0.016429787792191064,
                    -0.01372768334803709,
                    -0.03346466757579947,
                    ...
                ],
                [
                    -0.0936709871750665,
                    -0.0331261565078762,
                    0.06435934332282045,
                    -0.037159459006495865,
                    0.03486157659394706,
                    0.006024915022308619,
                    0.01689568685248588,
                    -0.008014505719849296,
                    ...
                ],
                [
                    0.19160803277671334,
                    0.0720129699881511,
                    0.046999435285798086,
                    -0.016470494564551075,
                    0.08760990913324918,
                    -0.056012653372455645,
                    0.05859393421414438,
                    -0.0757104323515776,
                    ...
                ],
                [
                    -0.08292170351253958,
                    -0.04457089187183527,
                    -0.02132341694053963,
                    0.09498880685247002,
                    -0.02909020660642625,
                    -0.08919508767951356,
                    0.06544754755240234,
                    0.04402972234209557,
                    ...
                ],
                [
                    -0.05259925769379237,
                    0.17734065966617152,
                    0.05323547448775656,
                    -0.031241010441334538,
                    0.06060992018126399,
                    0.057363400444444324,
                    0.049313153043023306,
                    0.036693847332685436,
                    ...
                ],
                [
                    0.04623782481657912,
                    -0.1742257660234543,
                    0.008555611227521354,
                    0.0780235468237791,
                    0.05726359559632499,
                    0.07284318681385571,
                    -0.4294163047979822,
                    0.03223193133478475,
                    ...
                ],
                [
                    0.006707578514678403,
                    0.1224868275087374,
                    0.0819144995306047,
                    -0.0927164958223301,
                    0.07133213032459196,
                    -0.021621178716586087,
                    -0.02652805075825171,
                    0.024320665561241128,
                    ...
                ],
                [
                    -0.1892194282145212,
                    0.05937301539450601,
                    -0.009251856831049565,
                    0.00966011187865945,
                    -0.0506608728653859,
                    -0.000005505544429240672,
                    0.0768259129498503,
                    -0.021634040275564814,
                    ...
                ],
                [
                    -0.01482087894996198,
                    -0.04388643789315694,
                    -0.15656598135250732,
                    -0.02290621225324685,
                    -0.02460270912033735,
                    0.05281604056343525,
                    -0.027855395398596272,
                    -0.02307402952937565,
                    ...
                ],
                [
                    0.0997225814284306,
                    0.023947398136456854,
                    0.00972046412779548,
                    0.025144240173403887,
                    -0.015702242983785197,
                    -0.03782832466826136,
                    0.21077909870614775,
                    0.006800753397236292,
                    ...
                ],
                [
                    0.033955607721260606,
                    -0.09951870036752553,
                    -0.08152383052087941,
                    -0.07023362168592229,
                    -0.06128076790366066,
                    -0.03250179949124998,
                    0.07098260870256468,
                    0.001405350739482003,
                    ...
                ],
                [
                    0.06134668040730418,
                    0.005584343144498405,
                    -0.03690600320344117,
                    0.0443477648832129,
                    -0.05692983359072201,
                    0.06454679442065725,
                    -0.05131050771772294,
                    0.016415404744613227,
                    ...
                ]
            ]
        }

Mislabeling