Raw data access and API

API for accessing raw data

inStrain stores much more data than is shown in the output folder. It is kept in the raw_data folder, and is mostly stored in compressed formats. This data can be easily accessed using python, as described below.

To access the data, you first make an SNVprofile object of the inStrain output profile, and then you access data from that object. For example, the following code accessed the raw SNP table

import inStrain
import inStain.SNVprofile

IS = inStain.SNVprofile.SNVprofile(``/home/mattolm/inStrainOutputTest/``)
raw_snps = IS.get('raw_snp_table')

You can use the example above (IS.get()) to access any of the raw data described in the following section. There are also another special things that are accessed in other ways, as described in the section “Accessing other data”

Basics of raw_data

A typical run of inStrain will yield a folder titled “raw_data”, with lots of individual files in it. The specifics of what files are in there depend on how inStrain was run, and whether or not additional commands were run as well (like profile_genes).

There will always be a file titled “attributes.tsv”. This describes some basic information about each item in the raw data. Here’s an example

attributes.tsv

name

value

type

description

location

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test

value

Location of SNVprofile object

version

1.3.3

value

Version of inStrain

Rdic

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/Rdic.json

dictionary

Read pair -> mismatches

mapping_info

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/mapping_info.csv.gz

pandas

Report on reads

fasta_loc

/Users/mattolm/Programs/inStrain/test/test_data/N5_271_010G1_scaffold_min1000.fa

value

Location of .fasta file used during profile

scaffold2length

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/scaffold2length.json

dictionary

Dictionary of scaffold 2 length

object_type

profile

value

Type of SNVprofile (profile or compare)

bam_loc

/Users/mattolm/Programs/inStrain/test/test_data/N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam

value

Location of .bam file

scaffold_list

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/scaffold_list.txt

list

1d list of scaffolds that were profiled

raw_linkage_table

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/raw_linkage_table.csv.gz

pandas

Raw table of linkage information

raw_snp_table

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/raw_snp_table.csv.gz

pandas

Contains raw SNP information on a mm level

cumulative_scaffold_table

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/cumulative_scaffold_table.csv.gz

pandas

Cumulative coverage on mm level. Formerly scaffoldTable.csv

cumulative_snv_table

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/cumulative_snv_table.csv.gz

pandas

Cumulative SNP on mm level. Formerly snpLocations.pickle

scaffold_2_mm_2_read_2_snvs

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/scaffold_2_mm_2_read_2_snvs.pickle

pickle

crazy nonsense needed for linkage

genes_coverage

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/genes_coverage.csv.gz

pandas

Coverage of individual genes

genes_clonality

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/genes_clonality.csv.gz

pandas

Clonality of individual genes

genes_SNP_count

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/genes_SNP_count.csv.gz

pandas

SNP density and counts of individual genes

SNP_mutation_types

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/SNP_mutation_types.csv.gz

pandas

The mutation types of SNPs

covT

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/covT.hd5

special

Scaffold -> mm -> position based coverage

clonT

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/clonT.hd5

special

Scaffold -> mm -> position based clonality

clonTR

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/clonTR.hd5

special

Scaffold -> mm -> rarefied position based clonality

genes_fileloc

/Users/mattolm/Programs/inStrain/test/test_data/N5_271_010G1_scaffold_min1000.fa.genes.fna

value

Location of genes file that was used to call genes

genes_table

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/genes_table.csv.gz

pandas

Location of genes in the associated genes_file

scaffold2bin

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/scaffold2bin.json

dictionary

Dictionary of scaffold 2 bin

bin2length

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/bin2length.json

dictionary

Dictionary of bin 2 total length

genome_level_info

/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/genome_level_info.csv.gz

pandas

Table of genome-level information

This is what the columns correspond to:

name

The name of the data. This is the name that you put into IS.get() to have inStrain retrieve the data for you. See the section “Accessing raw data” for an example.

value

This lists the path to where the data is located within the raw_data folder. If the type of data is a value, than this just lists the value

type

This describes how the data is stored. Value = the data is whatever is listed under value; list = a python list; numpy = a numpy array; dictionary = a python dictionary; pandas = a pandas dataframe; pickle = a piece of data that’s stored as a python pickle object; special = a piece of data that is stored in a special way that inStrain knows how to de-compress

description

A one-sentence description of what’s in the data.

Warning

Many of these pieces of raw data have the column “mm” in them, which means that things are calculated at every possible read mismatch level. This is often not what you want. See the section “Dealing with mm” for more information.

Accessing other data

In addition to the raw_data described above, there are a couple of other things that inStrain can make for you. You access these from methods that run on the IS object itself, instead of using the get method. For example:

import inStrain
import inStain.SNVprofile

IS = inStain.SNVprofile.SNVprofile(``/home/mattolm/inStrainOutputTest/``)
coverage_table = IS.get_raw_coverage_table()

The following methods work like that:

get_nonredundant_scaffold_table()

Get a scaffold table with just one line per scaffold, not multiple mms

get_nonredundant_linkage_table()

Get a linkage table with just one line per scaffold, not multiple mms

get_nonredundant_snv_table()

Get a SNP table with just one line per scaffold, not multiple mms

get_clonality_table()

Get a raw clonality table, listing the clonality of each position. Pass nonredundant=False to keep multiple mms

Dealing with “mm”

Behind the scenes, inStrain actually calculates pretty much all metrics for every read pair mismatch level. That is, only including read pairs with 0 mis-match to the reference sequences, only including read pairs with >= 1 mis-match to the reference sequences, all the way up to the number of mismatches associated with the “PID” parameter.

For most of the output that inStrain makes in the output folder, it removes the “mm” column and just gives the results for the maximum number of mismatches. However, it’s often helpful to explore other mismatches levels, to see how parameters vary with more or less stringent mappings. Much of the data stored in “read_data” is on the mismatch level. Here’s an example of what the looks like (this is the cumulative_scaffold_table):

,scaffold,length,breadth,coverage,coverage_median,coverage_std,bases_w_0_coverage,mean_clonality,median_clonality,unmaskedBreadth,SNPs,breadth_expected,ANI,mm
0,N5_271_010G1_scaffold_102,1144,0.9353146853146853,5.106643356643357,5,2.932067325774674,74,1.0,1.0,0.6145104895104895,0,0.9889923642060382,1.0,0
1,N5_271_010G1_scaffold_102,1144,0.9353146853146853,6.421328671328672,6,4.005996333777764,74,0.9992001028104149,1.0,0.6748251748251748,0,0.9965522492489882,1.0,1
2,N5_271_010G1_scaffold_102,1144,0.9423076923076923,7.3627622377622375,7,4.2747074564903285,66,0.9993874800638958,1.0,0.7928321678321678,0,0.998498542620078,1.0,2
3,N5_271_010G1_scaffold_102,1144,0.9423076923076923,7.859265734265734,8,4.748789115369562,66,0.9992251555869703,1.0,0.7928321678321678,0,0.9990314705263914,1.0,3
4,N5_271_010G1_scaffold_102,1144,0.9423076923076923,8.017482517482517,8,4.952541407151938,66,0.9992251555869703,1.0,0.7928321678321678,0,0.9991577528529144,1.0,4
5,N5_271_010G1_scaffold_102,1144,0.9458041958041958,8.271853146853147,8,4.9911156795536105,62,0.9992512780077317,1.0,0.8024475524475524,0,0.9993271891539499,1.0,7

As you can see, the same scaffold is shown multiple times, and the last column is mm. At the row with mm = 0, you can see what the stats are when only considering reads that perfectly map to the reference sequence. As the mm goes higher, so do stats like coverage and breadth, as you now allow reads with more mismatches to count in the generation of these stats. In order to convert this files to what is provided in the output folder, the following code is run:

import inStrain
import inStain.SNVprofile

IS = inStain.SNVprofile.SNVprofile(``/home/mattolm/inStrainOutputTest/``)
scdb = IS.get('cumulative_scaffold_table')
ScaffDb = scdb.sort_values('mm')\
            .drop_duplicates(subset=['scaffold'], keep='last')\
            .sort_index().drop(columns=['mm'])

The last line looks complicated, but it’s very simple what is going on. First, you sort the database by mm, with the lowest mms at the top. Next, for each scaffold, you only keep the row with the lowest mm. That’s done using the drop_duplicates(subset=['scaffold'], keep='last') command. Finally, you re-sort the DataFrame to the original order, and remove the mm column. In the above example, this would mean that the only row that would survive would be where mm = 7, because that’s the bottom row for that scaffold.

You can of course subset to any level of mismatch by modifying the above code slightly. For example, to generate this table only using reads with <=5 mismatches, you could use the following code:

import inStrain
import inStain.SNVprofile

IS = inStain.SNVprofile.SNVprofile(``/home/mattolm/inStrainOutputTest/``)
scdb = IS.get('cumulative_scaffold_table')
scdb = scdb[scdb['mm'] <= 5]
ScaffDb = scdb.sort_values('mm')\
            .drop_duplicates(subset=['scaffold'], keep='last')\
            .sort_index().drop(columns=['mm'])

Warning

You usually do not want to subset these DataFrames using something like scdb = scdb[scdb['mm'] == 5]. That’s because if there are no reads that have 5 mismatches, as in the case above, you’ll end up with an empty DataFrame. By using the drop_duplicates technique described above you avoid this problem, because in the cases where you don’t have 5 mismatches, you just get the next-highest mm level (which is usually what you want)

A note for programmers

If you’d like to edit inStrain to add functionality for your data, don’t hesitate to reach out to the authors of this program for help. Additionally, please consider submitting a pull request on GitHub so that others can use your changes as well.