Data for : Poly(A) Dataset for PAS sequences and pseudo-PAS sequences Classification (fasta format)

  • Fahad Albalawi (Creator)
  • Abderrazak Chahid (Creator)
  • Xingang Guo (Creator)
  • Somayah Albaradei (Creator)
  • Arturo Magana-Mora (Creator)
  • Boris R. Jankovic (Creator)
  • Mahmut Uludag (Creator)
  • Christophe Van Neste (Creator)
  • Magbubah Essack (Creator)
  • Taous-Meriem Laleg-Kirati (Creator)
  • Vladimir B. Bajic (Creator)
  • Fahad Albalawi (Creator)
  • Boris R. Jankovic (Creator)
  • Mahmut Uludag (Creator)
  • Christophe Van Neste (Creator)
  • Magbubah Essack (Creator)

Dataset

Description

This Dataset contains DNA sequences of the human genome hg38 from GENCODE folder at EBI ftp server
(ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh38.primary_assembly.genome.fa.gz)

A-Positive set (PAS sequences)

Using GENCODE annotation for poly(A)
(ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.polyAs.gff3.gz) We selected poly(A) signal annotation. Using bedtools-slop option, we found regions extended 300 bp upstream and 300 bp downstream of the poly(A) hexamer. With the bedtools-getfasta option, we extracted 606 bp fasta sequences from these regions. After eliminating duplicates, we obtained 37’516 presumed true functional poly(A) signal (PAS) sequences. Sequences from this set will be denoted as positive.

B- Negative set (pseudo-PAS sequences)
For the negative set, we looked for regions extended outside the region covering 1’000 bp upstream and downstream of the positive poly(A) hexamer signal using bedtools-complement. Homer tool was used to find matches for the 12 most frequent human poly(A) variants. Since the number of matches was huge, sampling was used to select 37’516 pseudo-PAS sequences. Sampling was done from each chromosome proportionally to the lengths of the chromosomes and also to the expected frequency of the poly(A) variants. Out of these predictions, for each PAS hexamer, we selected the same number of pseudo-PAS sequences as in the positive set.

Training and testing sets
We selected randomly from each of the positive and negative datasets 20% of sequences for the independent test data. The testing set thus consisted of 15’020 sequences. The remaining data represented the training set that consisted of 60’012 sequences. Both datasets are balanced relative to the true PAS and pseudo-PAS sequences.
Date made availableNov 15 2018
PublisherKAUST Research Repository

Cite this