Biology
Biology and bioinformatics-oriented data cleaning functions.
join_fasta(df, filename, id_col, column_name)
Convenience method to join in a FASTA file as a column.
This allows us to add the string sequence of a FASTA file as a new column of data in the dataframe.
This method only attaches the string representation of the SeqRecord.Seq object from Biopython. Does not attach the full SeqRecord. Alphabet is also not stored, under the assumption that the data scientist has domain knowledge of what kind of sequence is being read in (nucleotide vs. amino acid.)
This method mutates the original DataFrame.
For more advanced functions, please use phylopandas.
Method chaining usage example:
import tempfile import pandas as pd import janitor.biology
tf = tempfile.NamedTemporaryFile() tf.write('''>SEQUENCE_1 ... MTEITAAMVKELRESTGAGMMDCK ... >SEQUENCE_2 ... SATVSEINSETDFVAKN'''.encode('utf8')) 66 tf.seek(0) 0
df = pd.DataFrame({"sequence_accession": ... ["SEQUENCE_1", "SEQUENCE_2", ]})
df = df.join_fasta( ... filename=tf.name, ... id_col='sequence_accession', ... column_name='sequence', ... )
df.sequence 0 MTEITAAMVKELRESTGAGMMDCK 1 SATVSEINSETDFVAKN Name: sequence, dtype: object
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame |
A pandas DataFrame. |
required |
filename |
str |
Path to the FASTA file. |
required |
id_col |
str |
The column in the DataFrame that houses sequence IDs. |
required |
column_name |
str |
The name of the new column. |
required |
Returns:
Type | Description |
---|---|
DataFrame |
A pandas DataFrame with new FASTA string sequence column. |
Source code in janitor/biology.py
@pf.register_dataframe_method
@deprecated_alias(col_name="column_name")
def join_fasta(
df: pd.DataFrame, filename: str, id_col: str, column_name: str
) -> pd.DataFrame:
"""
Convenience method to join in a FASTA file as a column.
This allows us to add the string sequence of a FASTA file as a new column
of data in the dataframe.
This method only attaches the string representation of the SeqRecord.Seq
object from Biopython. Does not attach the full SeqRecord. Alphabet is
also not stored, under the assumption that the data scientist has domain
knowledge of what kind of sequence is being read in (nucleotide vs. amino
acid.)
This method mutates the original DataFrame.
For more advanced functions, please use phylopandas.
Method chaining usage example:
>>> import tempfile
>>> import pandas as pd
>>> import janitor.biology
>>> tf = tempfile.NamedTemporaryFile()
>>> tf.write('''>SEQUENCE_1
... MTEITAAMVKELRESTGAGMMDCK
... >SEQUENCE_2
... SATVSEINSETDFVAKN'''.encode('utf8'))
66
>>> tf.seek(0)
0
>>> df = pd.DataFrame({"sequence_accession":
... ["SEQUENCE_1", "SEQUENCE_2", ]})
>>> df = df.join_fasta(
... filename=tf.name,
... id_col='sequence_accession',
... column_name='sequence',
... )
>>> df.sequence
0 MTEITAAMVKELRESTGAGMMDCK
1 SATVSEINSETDFVAKN
Name: sequence, dtype: object
:param df: A pandas DataFrame.
:param filename: Path to the FASTA file.
:param id_col: The column in the DataFrame that houses sequence IDs.
:param column_name: The name of the new column.
:returns: A pandas DataFrame with new FASTA string sequence column.
"""
seqrecords = {
x.id: x.seq.__str__() for x in SeqIO.parse(filename, "fasta")
}
seq_col = [seqrecords[i] for i in df[id_col]]
df[column_name] = seq_col
return df