Biology

Biology and bioinformatics-oriented data cleaning functions.

`join_fasta(df, filename, id_col, column_name)`

Convenience method to join in a FASTA file as a column.

This allows us to add the string sequence of a FASTA file as a new column of data in the dataframe.

This method only attaches the string representation of the SeqRecord.Seq object from Biopython. Does not attach the full SeqRecord. Alphabet is also not stored, under the assumption that the data scientist has domain knowledge of what kind of sequence is being read in (nucleotide vs. amino acid.)

This method mutates the original DataFrame.

For more advanced functions, please use phylopandas.

Examples:

>>> import tempfile
>>> import pandas as pd
>>> import janitor.biology
>>> tf = tempfile.NamedTemporaryFile()
>>> tf.write('''>SEQUENCE_1
... MTEITAAMVKELRESTGAGMMDCK
... >SEQUENCE_2
... SATVSEINSETDFVAKN'''.encode('utf8'))
66
>>> tf.seek(0)
0
>>> df = pd.DataFrame({"sequence_accession":
... ["SEQUENCE_1", "SEQUENCE_2", ]})
>>> df = df.join_fasta(
...     filename=tf.name,
...     id_col='sequence_accession',
...     column_name='sequence',
... )
>>> df.sequence
0    MTEITAAMVKELRESTGAGMMDCK
1           SATVSEINSETDFVAKN
Name: sequence, dtype: object

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A pandas DataFrame.	required
`filename`	`str`	Path to the FASTA file.	required
`id_col`	`str`	The column in the DataFrame that houses sequence IDs.	required
`column_name`	`str`	The name of the new column.	required

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with new FASTA string sequence column.

Source code in janitor/biology.py

@pf.register_dataframe_method
@deprecated_alias(col_name="column_name")
def join_fasta(
    df: pd.DataFrame, filename: str, id_col: str, column_name: str
) -> pd.DataFrame:
    """Convenience method to join in a FASTA file as a column.

    This allows us to add the string sequence of a FASTA file as a new column
    of data in the dataframe.

    This method only attaches the string representation of the SeqRecord.Seq
    object from Biopython. Does not attach the full SeqRecord. Alphabet is
    also not stored, under the assumption that the data scientist has domain
    knowledge of what kind of sequence is being read in (nucleotide vs. amino
    acid.)

    This method mutates the original DataFrame.

    For more advanced functions, please use phylopandas.

    Examples:
        >>> import tempfile
        >>> import pandas as pd
        >>> import janitor.biology
        >>> tf = tempfile.NamedTemporaryFile()
        >>> tf.write('''>SEQUENCE_1
        ... MTEITAAMVKELRESTGAGMMDCK
        ... >SEQUENCE_2
        ... SATVSEINSETDFVAKN'''.encode('utf8'))
        66
        >>> tf.seek(0)
        0
        >>> df = pd.DataFrame({"sequence_accession":
        ... ["SEQUENCE_1", "SEQUENCE_2", ]})
        >>> df = df.join_fasta(  # doctest: +SKIP
        ...     filename=tf.name,
        ...     id_col='sequence_accession',
        ...     column_name='sequence',
        ... )
        >>> df.sequence  # doctest: +SKIP
        0    MTEITAAMVKELRESTGAGMMDCK
        1           SATVSEINSETDFVAKN
        Name: sequence, dtype: object

    Args:
        df: A pandas DataFrame.
        filename: Path to the FASTA file.
        id_col: The column in the DataFrame that houses sequence IDs.
        column_name: The name of the new column.

    Returns:
        A pandas DataFrame with new FASTA string sequence column.
    """
    seqrecords = {
        x.id: x.seq.__str__() for x in SeqIO.parse(filename, "fasta")
    }
    seq_col = [seqrecords[i] for i in df[id_col]]
    df[column_name] = seq_col
    return df