GC box gene transcriptions

Editor-In-Chief: Henry A. Hoff

A GC box is also known as a GSG box.^[1]

"One response element that was highly enriched in these antioxidant gene promoters was the GC box (5′-(G/T)GGGCGG(G/A)(G/A)(C/T)-3′, the binding site for the transcription factor Sp1 (specificity protein 1) (Ryu et al., 2003a, Ryu et al., 2003b)."^[2]

Boxes

A "repeating sequence of nucleotides that forms a transcription or a regulatory signal"^[3] is a box.

GC box theory

Def. "[a] sequence of contiguous guanine, guanine, guanine, cytosine, and guanine, in that order, along a DNA strand"^[4] is called a GC box.

GC elements

The GC elements are bound by transcription factors and have similar functions to enhancers.^[5]

Alu repeats

Karyotype from a female human lymphocyte (46, XX). Chromosomes were hybridized with a probe for Alu elements (green) and counterstained with TOPRO-3 (red). Alu elements were used as a marker for chromosomes and chromosome bands rich in genes. Credit: Andreas Bolzer, Gregor Kreth, Irina Solovei, Daniela Koehler, Kaan Saracoglu, Christine Fauth, Stefan Müller, Roland Eils, Christoph Cremer, Michael R. Speicher, Thomas Cremer.

"GC-rich genomic sequences [include those] such as Alu repeats."^[6]

"An Alu element is a short stretch [2-8 nucleotides] of DNA originally characterized by the action of the Alu (Arthrobacter luteus) restriction endonuclease.^[7] Alu elements of different kinds occur in large numbers in primate genomes. In fact, Alu elements are the most abundant transposable elements in the human genome."^[8]

"The Alu family is a family of repetitive elements in the human genome. Modern Alu elements are about 300 base pairs long and are therefore classified as short interspersed elements (SINEs) among the class of repetitive DNA elements. The typical structure is 5'Part A- A5TACA6 -Part B - PolyA Tail - 3', where Part A and Part B are similar peptide sequences, but of opposite direction."^[8]

There are over one million Alu elements interspersed throughout the human genome, and it is estimated that about 10.7% of the human genome consists of Alu sequences. However less than 0.5% are polymorphic.^[9]

Alu elements are retrotransposons and look like DNA copies made from RNA polymerase III-encoded RNAs. Alu elements do not encode for protein products and depend on LINE retrotransposons for their replication.^[10]

"Alu elements in primates form a fossil record that is relatively easy to decipher because Alu elements insertion events have a characteristic signature that is both easy to read and faithfully recorded in the genome from generation to generation. The study of Alu elements thus reveals details of ancestry because individuals will only share a particular Alu element insertion if they have a common ancestor."^[8]

Most human Alu element insertions can be found in the corresponding positions in the genomes of other primates, but about 7,000 Alu insertions are unique to humans.^[11]

"Full-length Alu elements are ~300 bp long and are commonly found in introns, 3 untranslated regions of genes and intergenic genomic regions".^[12] Human subfamilies include Y, Yc1, Yc2, Ya5, Ya5a2, Yb8, and Yb9.^[12] A source of simple sequence repeats is an A-rich region "that contains the sequence A₅TACA₆".^[12]

"[T]here are ~24 CpG positions in a new Alu insertion ... the decay of methylated CpG dinucleotides into TpG dinucleotides would also tend to increase the pair-wise divergence between Alu repeats over time, thereby decreasing the recombination between elements."^[12]

CpG sites

"CpG sites or CG sites are regions of DNA where a cytosine nucleotide occurs next to a guanine nucleotide in the linear sequence of bases along its length. "CpG" is shorthand for "—C—phosphate—G—", that is, cytosine and guanine separated by only one phosphate; phosphate links any two nucleosides together in DNA. The "CpG" notation is used to distinguish this linear sequence from the CG base-pairing of cytosine and guanine. The CpG notation can also be interpreted as the cytosine being 5 prime to the guanine base."^[13] "The "p" in CpG refers to the phosphodiester bond between the cytosine and the guanine, which indicates that the C and the G are next to each other in sequence, regardless of being single- or double- stranded. In a CpG site, both C and G are found on the same strand of DNA or RNA and are connected by a phosphodiester bond. This is a covalent bond between atoms, stable and permanent as opposed to the three hydrogen bonds established after base-pairing of C and G in opposite strands of DNA."^[6]

CpG islands

There are regions of the genome that have a higher concentration of CpG sites, known as CpG islands. Many genes in mammalian genomes have CpG islands associated with the start of the gene^[14] (promoter regions). Because of this, the presence of a CpG island is used to help in the prediction and annotation of genes.

"The usual formal definition of a CpG island is a region with at least 200 [base pair] bp, and a GC percentage that is greater than 50%, and with an observed-to-expected CpG ratio that is greater than 60%. The "observed-to-expected CpG ratio" is calculated by formula ((Num of CpG/(Num of C × Num of G)) × Total number of nucleotides in the sequence).^[15]

In mammalian genomes, CpG islands are typically 300-3,000 base pairs in length, and have been found in or near approximately 40% of promoters of mammalian genes.^[16] About 70% of human promoters have a high CpG content. Given the frequency of GC two-nucleotide sequences, the number of CpG dinucleotides is much lower than would be expected.^[17]

"CpG islands are characterized by CpG dinucleotide content of at least 60% of that which would be statistically expected (~4–6%), whereas the rest of the genome has much lower CpG frequency (~1%), a phenomenon called CG suppression. Unlike CpG sites in the coding region of a gene, in most instances the CpG sites in the CpG islands of promoters are unmethylated if the genes are expressed."^[6]

CT elements

"The special protein (Sp) family consists of four C2H2 zinc finger DNA-binding proteins [specificity protein 1 (Sp1)-4] that mainly regulate the expression of many genes by binding to GC box motifs (GGGCGG), GT motifs (GGGTGTGGC) and CT elements (CCTCCTCCTCCTCGGCCTCCTCCCC) in the promoter region of the target genes [21]."^[18]

Methylation

"Cytosines in CpG dinucleotides can be methylated to form 5-methylcytosine. In mammals, methylating the cytosine within a gene can turn the gene off, a mechanism that is part of a larger field of science studying gene regulation that is called epigenetics. Enzymes that add a methyl group are called DNA methyltransferases."^[13]

In mammals, 70% to 80% of CpG cytosines are methylated.^[19]

"CpG dinucleotides have long been observed to occur with a much lower frequency in the sequence of vertebrate genomes than would be expected due to random chance. For example, in the human genome, which has a 42% GC content, a pair of nucleotides consisting of cytosine followed by guanine would be expected to occur 0.21 * 0.21 = 4.41% of the time. The frequency of CpG dinucleotides in human genomes is 1% — less than one-quarter of the expected frequency."^[13]

Unmethylated CpG sites can be detected by Toll-Like Receptor 9^[20] "(TLR 9) on plasmacytoid dendritic cells and B cells in humans. This is used to detect intracellular viral, fungal, and bacterial pathogen DNA."^[13]

Methylation is central to imprinting, along with histone modifications.^[21] Most of the methylation occurs a short distance from the CpG islands (at "CpG island shores") rather than in the islands themselves.^[22]

Methylation of CpG sites within the promoters of genes can lead to their silencing, a feature found in a number of human cancers (for example the silencing of tumor suppressor genes). In contrast, the hypomethylation of CpG sites has been associated with the over-expression of oncogenes within cancer cells.^[23]

Deamination

The CpG deficiency is due to an increased vulnerability of methylcytosines to spontaneously deaminate to thymine in genomes with CpG cytosine methylation.^[24]

Mutations

Alu elements are a common source of mutation in humans, but such mutations are often confined to non-coding regions where they have little discernible impact on the bearer.^[25]

The mutagenic effect of Alu^[26] and retrotransposons in general^[27] "has played a major role in the recent evolution of the human genome."^[8]

The first report of Alu-mediated recombination causing a prevalent inherited predisposition to cancer was a 1995 report about hereditary nonpolyposis colorectal cancer.^[28]

"The human diseases caused by Alu insertions include":^[12]

The following diseases have been associated with single-nucleotide DNA variations in Alu elements impacting transcription levels:^[29]

The ACE gene, encoding angiotensin-converting enzyme, has 2 common variants, one with an Alu insertion (ACE-I) and one with the Alu deleted (ACE-D). This variation has been linked to changes in sporting ability: the presence of the Alu element is associated with better performance in endurance-oriented events (e.g. triathlons), whereas its absence is associated with strength- and power-oriented performance^[30]

The opsin gene duplication which resulted in the re-gaining of trichromacy in Old World primates (including humans) is flanked by an Alu element,^[31] "implicating the role of Alu in the evolution of three colour vision."^[8]

Consensus sequences

"A GC box sequence, one of the most common regulatory DNA elements of eukaryotic genes, is recognized by the Spl transcription factor; its consensus sequence is represented as 5'-G/T G/A GGCG G/T G/A G/A C/T-3' [or 5′-KRGGCGKRRY-3′] (Briggs et al., 1986)."^[32]

Transcription start sites

"In promoters containing multiple GC boxes but lacking the TATAA box, transcription start sites may be single and specific, as observed in the nerve growth factor receptor gene (42) and the cellular retinol-binding protein gene (37), or there may be multiple heterogeneous start sites, such as those found in the c-myb (4), insulin receptor (45), and Ha-ras (21) genes. ... GC boxes are responsible for directing transcription from the major and the minor start sites. ... All TATAA-less promoters have at least two GC boxes".^[33]

"CpG islands typically occur at or near the transcription start site of genes, particularly housekeeping genes, in vertebrates.^[17] Normally a C (cytosine) base followed immediately by a G (guanine) base (a CpG) is rare in vertebrate DNA because the cytosines in such an arrangement tend to be methylated. This methylation helps distinguish the newly synthesized DNA strand from the parent strand, which aids in the final stages of DNA proofreading after duplication. However, over evolutionary time methylated cytosines tend to turn into thymines because of spontaneous deamination. While there is a special enzyme in human (Thymine-DNA glycosylase, or TDG) that specifically replaces T's from T/G mismatches, it is not sufficiently effective to prevent the relatively rapid mutation of the dinucleotides. The result is that CpGs are relatively rare. The existence of CpG islands is usually explained by the existence of selective forces for relatively high CpG content, or low levels of methylation in that genomic area, perhaps having to do with the regulation of gene expression. Recently a study showed that most CpG islands are a result of non-selective forces. ^[34]"^[6]

Transcription factors

"[A] GC box-binding factor is required for transcription and ... a truncated promoter containing one GC box is transcriptionally inactive (44). ... the DNA-protein interactions occurring at the GC boxes in the DHFR promoter are functionally distinct and that factors binding to the GC boxes must interact in a position-dependent manner."^[33]

Human genes

"A large subclass of polymerase II promoters lacks both TATAA and CCAAT sequence motifs but contains multiple GC boxes. This promoter class includes several housekeeping genes (e.g., the genes encoding dihydrofolate reductase [DHFR] ..., hydroxymethylglutaryl coenzyme A reductase [39], hypoxanthine guanine phosphoribosyltransferase [33], and adenosine deaminase [46]) [and] nonhousekeeping genes (e.g., the transforming growth factor alpha [9, 23], rat malic enzyme [36], human c-Ha-ras [21], epidermal growth factor receptor [22], and nerve growth factor receptor [42] genes)."^[33]

Some 12,000 SP-1 binding sites are found in the human genome.^[35]

Hypotheses

The GC box does not indicate the TSS for A1BG.
A1BG has no GC boxes in either promoter.
A1BG is not transcribed by an GC box.
A GC box does not participate in the transcription of A1BG.

GC box (Briggs) samplings

"A GC box [...] consensus sequence is represented as 5'-(G/T)(G/A)GGCG(G/T)(G/A)(G/A)C/T-3' [or 5′-KRGGCGKRRY-3′] (Briggs et al., 1986)."^[32]

For the Basic programs (starting with SuccessablesGC.bas) written to compare nucleotide sequences with the sequences on either the template strand (-), or coding strand (+), of the DNA, in the negative direction (-), or the positive direction (+), the programs are, are looking for out to 4445, and found:

negative strand in the negative direction is SuccessablesGC--.bas, looking for (G/T)(G/A)GGCG(G/T)(G/A)(G/A)(C/T), 0.
positive strand in the negative direction is SuccessablesGC+-.bas, looking for (G/T)(G/A)GGCG(G/T)(G/A)(G/A)(C/T), 2, TGGGCGTGGT at 3048, TGGGCGTGGT at 1898.
negative strand in the positive direction is SuccessablesGC-+.bas, looking for (G/T)(G/A)GGCG(G/T)(G/A)(G/A)(C/T), 1, TGGGCGGGAC at 409.
positive strand in the positive direction is SuccessablesGC++.bas, looking for (G/T)(G/A)GGCG(G/T)(G/A)(G/A)(C/T), 0.
inverse complement, negative strand, negative direction is SuccessablesGCci--.bas, looking for (A/G)(C/T)(C/T)(A/C)CGCC(C/T)(A/C), 1, ACTCCGCCCA at 3092.
inverse complement, positive strand, negative direction is SuccessablesGCci+-.bas, looking for (A/G)(C/T)(C/T)(A/C)CGCC(C/T)(A/C), 1, GCTCCGCCTC at 1505.
inverse complement, negative strand, positive direction is SuccessablesGCci-+.bas, looking for (A/G)(C/T)(C/T)(A/C)CGCC(C/T)(A/C), 0,
inverse complement, positive strand, positive direction is SuccessablesGCci++.bas, looking for (A/G)(C/T)(C/T)(A/C)CGCC(C/T)(A/C), 1, GCCACGCCCC at 491.

GC (4560-2846) UTRs

Negative strand, negative direction: ACTCCGCCCA at 3092.
Positive strand, negative direction: TGGGCGTGGT at 3048.

GC negative direction (2596-1) distal promoters

Positive strand, negative direction: TGGGCGTGGT at 1898.
Positive strand, negative direction: GCTCCGCCTC at 1505.

GC positive direction (4050-1) distal promoters

Negative strand, positive direction: TGGGCGGGAC at 409.
Positive strand, positive direction: GCCACGCCCC at 491.

GC box (Briggs) random dataset samplings

GCBboxr0: 0.
GCBboxr1: 0.
GCBboxr2: 0.
GCBboxr3: 0.
GCBboxr4: 0.
GCBboxr5: 1, GGGGCGTAAT at 645.
GCBboxr6: 0.
GCBboxr7: 3, TGGGCGGAGC at 1223, GAGGCGGGGC at 1067, TAGGCGGGGT at 1047.
GCBboxr8: 1, GGGGCGGGGT at 244.
GCBboxr9: 0.
GCBboxr0ci: 0.
GCBboxr1ci: 1, GTTCCGCCTC at 1647.
GCBboxr2ci: 1, GTTCCGCCCC at 641.
GCBboxr3ci: 1, GCTCCGCCTC at 3378.
GCBboxr4ci: 1, ACTACGCCTC at 3085.
GCBboxr5ci: 0.
GCBboxr6ci: 0.
GCBboxr7ci: 0.
GCBboxr8ci: 0.
GCBboxr9ci: 1, GCCCCGCCTC at 2763.

GCBboxr arbitrary (evens) (4560-2846) UTRs

GCBboxr4ci: ACTACGCCTC at 3085.

GCBboxr alternate (odds) (4560-2846) UTRs

GCBboxr3ci: GCTCCGCCTC at 3378.

GCBboxr alternate negative direction (odds) (2811-2596) proximal promoters

GCBboxr9ci: GCCCCGCCTC at 2763.

GCBboxr arbitrary negative direction (evens) (2596-1) distal promoters

GCBboxr8: GGGGCGGGGT at 244.
GCBboxr2ci: GTTCCGCCCC at 641.

GCBboxr alternate negative direction (odds) (2596-1) distal promoters

GCBboxr5: GGGGCGTAAT at 645.
GCBboxr7: TGGGCGGAGC at 1223, GAGGCGGGGC at 1067, TAGGCGGGGT at 1047.
GCBboxr1ci: GTTCCGCCTC at 1647.

GCBboxr arbitrary positive direction (odds) (4050-1) distal promoters

GCBboxr5: GGGGCGTAAT at 645.
GCBboxr7: TGGGCGGAGC at 1223, GAGGCGGGGC at 1067, TAGGCGGGGT at 1047.
GCBboxr1ci: GTTCCGCCTC at 1647.
GCBboxr3ci: GCTCCGCCTC at 3378.
GCBboxr9ci: GCCCCGCCTC at 2763.

GCBboxr alternate positive direction (evens) (4050-1) distal promoters

GCBboxr8: GGGGCGGGGT at 244.
GCBboxr2ci: GTTCCGCCCC at 641.
GCBboxr4ci: ACTACGCCTC at 3085.

GC box (Briggs) analysis and results

"A GC box sequence, one of the most common regulatory DNA elements of eukaryotic genes, is recognized by the Spl transcription factor; its consensus sequence is represented as 5'-G/T G/A GGCG G/T G/A G/A C/T-3' [or 5′-KRGGCGKRRY-3′] (Briggs et al., 1986)."^[32]

Reals or randoms	Promoters	direction	Numbers	Strands	Occurrences	Averages (± 0.1)
Reals	UTR	negative	2	2	1	1
Randoms	UTR	arbitrary negative	1	10	0.1	0.1
Randoms	UTR	alternate negative	1	10	0.1	0.1
Reals	Core	negative	0	2	0	0
Randoms	Core	arbitrary negative	0	10	0	0
Randoms	Core	alternate negative	0	10	0	0
Reals	Core	positive	0	2	0	0
Randoms	Core	arbitrary positive	0	10	0	0
Randoms	Core	alternate positive	0	10	0	0
Reals	Proximal	negative	0	2	0	0
Randoms	Proximal	arbitrary negative	0	10	0	0.05
Randoms	Proximal	alternate negative	1	10	0.1	0.05
Reals	Proximal	positive	0	2	0	0
Randoms	Proximal	arbitrary positive	0	10	0	0
Randoms	Proximal	alternate positive	0	10	0	0
Reals	Distal	negative	2	2	1	1
Randoms	Distal	arbitrary negative	2	10	0.2	0.35
Randoms	Distal	alternate negative	5	10	0.5	0.35
Reals	Distal	positive	2	2	1	1
Randoms	Distal	arbitrary positive	7	10	0.7	0.5
Randoms	Distal	alternate positive	3	10	0.3	0.5

Comparison:

The occurrences of real GC box (Briggs) UTRs and distals are greater than the randoms. This suggests that the real GC box (Briggs) are likely active or activable.

GC box (Ye) samplings

Copying a responsive elements consensus sequence GGGCGG and putting the sequence in "⌘F" finds four between ZNF497 and A1BG or none between ZSCAN22 and A1BG as can be found by the computer programs.

For the Basic programs testing consensus sequence GGGCGG (starting with SuccessablesGCYe.bas) written to compare nucleotide sequences with the sequences on either the template strand (-), or coding strand (+), of the DNA, in the negative direction (-), or the positive direction (+), the programs are, are looking for, and found:

negative strand, negative direction, looking for GGGCGG, 0.
positive strand, negative direction, looking for GGGCGG, 2, GGGCGG at 2724, GGGCGG at 1809.
negative strand, positive direction, looking for GGGCGG, 6, GGGCGG at 4439, GGGCGG at 4429, GGGCGG at 1901, GGGCGG at 1793, GGGCGG at 406, GGGCGG at 353.
positive strand, positive direction, looking for GGGCGG, 1, GGGCGG at 4238.
inverse complement, negative strand, negative direction, looking for CCGCCC, 3, CCGCCC at 4000, CCGCCC at 3091, CCGCCC at 1251.
inverse complement, positive strand, negative direction, looking for CCGCCC, 0.
inverse complement, negative strand, positive direction, looking for CCGCCC, 1, CCGCCC at 4292.
inverse complement, positive strand, positive direction, looking for CCGCCC, 5, CCGCCC at 4440, CCGCCC at 4430, CCGCCC at 2486, CCGCCC at 1026, CCGCCC at 407.

GCboxYe (4560-2846) UTRs

Negative strand, negative direction: CCGCCC at 4000, CCGCCC at 3091.

GCboxYe positive direction (4445-4265) core promoters

Negative strand, positive direction: GGGCGG at 4439, GGGCGG at 4429.
Negative strand, positive direction: CCGCCC at 4292.
Positive strand, positive direction: CCGCCC at 4440, CCGCCC at 4430.

GCboxYe negative direction (2811-2596) proximal promoters

Positive strand, negative direction: GGGCGG at 2724.

GCboxYe positive direction (4265-4050) proximal promoters

Positive strand, positive direction: GGGCGG at 4238.

GCboxYe negative direction (2596-1) distal promoters

Negative strand, negative direction: CCGCCC at 1251.
Positive strand, negative direction: GGGCGG at 1809.

GCboxYe positive direction (4050-1) distal promoters

Negative strand, positive direction: GGGCGG at 1901, GGGCGG at 1793, GGGCGG at 406, GGGCGG at 353.
Positive strand, positive direction: CCGCCC at 2486, CCGCCC at 1026, CCGCCC at 407.

GC box (Ye) random dataset samplings

GCboxYer0: 2, GGGCGG at 3546, GGGCGG at 2996.
GCboxYer1: 1, GGGCGG at 1152.
GCboxYer2: 1, GGGCGG at 2818.
GCboxYer3: 1, GGGCGG at 518.
GCboxYer4: 4, GGGCGG at 4496, GGGCGG at 4108, GGGCGG at 3549, GGGCGG at 1874.
GCboxYer5: 4, GGGCGG at 3181, GGGCGG at 2931, GGGCGG at 2034, GGGCGG at 177.
GCboxYer6: 1, GGGCGG at 4350.
GCboxYer7: 3, GGGCGG at 1777, GGGCGG at 1220, GGGCGG at 786.
GCboxYer8: 2, GGGCGG at 4121, GGGCGG at 241.
GCboxYer9: 1, GGGCGG at 3627.
GCboxYer0ci: 2, CCGCCC at 3891, CCGCCC at 1385.
GCboxYer1ci: 1, CCGCCC at 3759.
GCboxYer2ci: 3, CCGCCC at 2599, CCGCCC at 1967, CCGCCC at 640.
GCboxYer3ci: 6, CCGCCC at 4139, CCGCCC at 3080, CCGCCC at 2793, CCGCCC at 485, CCGCCC at 440, CCGCCC at 118.
GCboxYer4ci: 1, CCGCCC at 81.
GCboxYer5ci: 7, CCGCCC at 4354, CCGCCC at 4100, CCGCCC at 2816, CCGCCC at 1160, CCGCCC at 988, CCGCCC at 350, CCGCCC at 8.
GCboxYer6ci: 1, CCGCCC at 4338.
GCboxYer7ci: 3, CCGCCC at 4437, CCGCCC at 2743, CCGCCC at 2668.
GCboxYer8ci: 3, CCGCCC at 2474, CCGCCC at 2013, CCGCCC at 747.
GCboxYer9ci: 3, CCGCCC at 4447, CCGCCC at 2667, CCGCCC at 2450.

GCboxYer arbitrary (evens) (4560-2846) UTRs

GCboxYer0: GGGCGG at 3546, GGGCGG at 2996.
GCboxYer4: GGGCGG at 4496, GGGCGG at 4108, GGGCGG at 3549.
GCboxYer6: GGGCGG at 4350.
GCboxYer8: GGGCGG at 4121.
GCboxYer0ci: CCGCCC at 3891.
GCboxYer6ci: CCGCCC at 4338.

GCboxYer alternate (odds) (4560-2846) UTRs

GCboxYer5: GGGCGG at 3181, GGGCGG at 2931.
GCboxYer9: GGGCGG at 3627.
GCboxYer1ci: CCGCCC at 3759.
GCboxYer3ci: CCGCCC at 4139, CCGCCC at 3080.

GCboxYer arbitrary negative direction (evens) (2846-2811) core promoters

GCboxYer2: GGGCGG at 2818.

GCboxYer arbitrary positive direction (odds) (4445-4265) core promoters

GCboxYer5ci: CCGCCC at 4354.
GCboxYer7ci: CCGCCC at 4437.
GCboxYer9ci: CCGCCC at 4447.

GCboxYer alternate positive direction (evens) (4445-4265) core promoters

GCboxYer6: GGGCGG at 4350.

GCboxYer arbitrary negative direction (evens) (2811-2596) proximal promoters

GCboxYer2ci: CCGCCC at 2599.

GCboxYer alternate negative direction (odds) (2811-2596) proximal promoters

GCboxYer3ci: CCGCCC at 2793.

GCboxYer arbitrary positive direction (odds) (4265-4050) proximal promoters

GCboxYer3ci: CCGCCC at 4139.
GCboxYer5ci: CCGCCC at 4100.

GCboxYer alternate positive direction (evens) (4265-4050) proximal promoters

GCboxYer4: GGGCGG at 4108.
GCboxYer8: GGGCGG at 4121.

GCboxYer arbitrary negative direction (evens) (2596-1) distal promoters

GCboxYer4: GGGCGG at 1874.
GCboxYer8: GGGCGG at 241.
GCboxYer0ci: CCGCCC at 1385.
GCboxYer2ci: CCGCCC at 1967, CCGCCC at 640.
GCboxYer4ci: CCGCCC at 81.
GCboxYer8ci: CCGCCC at 2474, CCGCCC at 2013, CCGCCC at 747.

GCboxYer alternate negative direction (odds) (2596-1) distal promoters

GCboxYer1: GGGCGG at 1152.
GCboxYer3: GGGCGG at 518.
GCboxYer5: GGGCGG at 2034, GGGCGG at 177.
GCboxYer7: GGGCGG at 1777, GGGCGG at 1220, GGGCGG at 786.
GCboxYer3ci: CCGCCC at 485, CCGCCC at 440, CCGCCC at 118.

GCboxYer arbitrary positive direction (odds) (4050-1) distal promoters

GCboxYer1: GGGCGG at 1152.
GCboxYer3: GGGCGG at 518.
GCboxYer5: GGGCGG at 3181, GGGCGG at 2931, GGGCGG at 2034, GGGCGG at 177.
GCboxYer7: GGGCGG at 1777, GGGCGG at 1220, GGGCGG at 786.
GCboxYer9: GGGCGG at 3627.
GCboxYer1ci: CCGCCC at 3759.
GCboxYer3ci: CCGCCC at 3080, CCGCCC at 2793, CCGCCC at 485, CCGCCC at 440, CCGCCC at 118.
GCboxYer5ci: CCGCCC at 2816, CCGCCC at 1160, CCGCCC at 988, CCGCCC at 350, CCGCCC at 8.
GCboxYer7ci: CCGCCC at 2743, CCGCCC at 2668.
GCboxYer9ci: CCGCCC at 2667, CCGCCC at 2450.

GCboxYer alternate positive direction (evens) (4050-1) distal promoters

GCboxYer0: GGGCGG at 3546, GGGCGG at 2996.
GCboxYer2: GGGCGG at 2818.
GCboxYer4: GGGCGG at 3549, GGGCGG at 1874.
GCboxYer8: GGGCGG at 241.
GCboxYer0ci: CCGCCC at 3891, CCGCCC at 1385.
GCboxYer2ci: CCGCCC at 2599, CCGCCC at 1967, CCGCCC at 640.

GC box (Ye) analysis and results

"The special protein (Sp) family consists [...] DNA-binding proteins [...] that mainly regulate the expression of many genes by binding to GC box motifs (GGGCGG), [...]."^[18]

Reals or randoms	Promoters	direction	Numbers	Strands	Occurrences	Averages (± 0.1)
Reals	UTR	negative	2	2	1	1
Randoms	UTR	arbitrary negative	9	10	0.9	0.75
Randoms	UTR	alternate negative	6	10	0.6	0.75
Reals	Core	negative	0	2	0	0
Randoms	Core	arbitrary negative	1	10	0.1	0.05
Randoms	Core	alternate negative	1	10	0.1	0.05
Reals	Core	positive	2	2	1	1
Randoms	Core	arbitrary positive	3	10	0.3	0.2
Randoms	Core	alternate positive	1	10	0.1	0.2
Reals	Proximal	negative	1	2	0.5	0.5
Randoms	Proximal	arbitrary negative	1	10	0.1	0.1
Randoms	Proximal	alternate negative	1	10	0.1	0.1
Reals	Proximal	positive	1	2	0.5	0.5
Randoms	Proximal	arbitrary positive	2	10	0.2	0.2
Randoms	Proximal	alternate positive	2	10	0.2	0.2
Reals	Distal	negative	2	2	1	1
Randoms	Distal	arbitrary negative	9	10	0.9	0.95
Randoms	Distal	alternate negative	10	10	1	0.95
Reals	Distal	positive	7	2	3.5	3.5
Randoms	Distal	arbitrary positive	25	10	2.5	1.8
Randoms	Distal	alternate positive	11	10	1.1	1.8