Comparison between one element and all other columns of the DataFrame
I have a list of tuples that I have turned into a DataFrame with thousands of rows, for example:
frag mass prot_position
0 TFDEHNAPNSNSNK 1573.675712 2
1 EPGANAIGMVAFK 1303.659458 29
2 GTIK 417.258734 2
3 SPWPSMAR 930.438172 44
4 LPAK 427.279469 29
5 NEDSFVVWEQIINSLSALK 2191.116099 17
...
and I have the following rule:
def are_dif(m1, m2, ppm=10):
if abs((m1 - m2) / m1) < ppm * 0.000001:
v = False
else:
v = True
return v
So, I just want the "fragment" to have a mass that is different from the mass of all the other fragments. How can I achieve this "choice"?
Then I have a list named "pinfo" that contains:
d = {'id':id, 'seq':seq_code, "1HW_fit":hits_fit}
# one for each protein
# each dictionary as the position of the protein that it describes.
So, I want to sum 1 to the value "hits_fit" in the dictionary corresponding to the protein.
source to share
If I understand correctly (not sure if I have one), you can accomplish quite a bit just by sorting. First, let me tune the data for near and far values ββfor mass:
Unnamed: 0 frag mass prot_position
0 0 TFDEHNAPNSNSNK 1573.675712 2
1 1 EPGANAIGMVAFK 1573.675700 29
2 2 GTIK 417.258734 2
3 3 SPWPSMAR 417.258700 44
4 4 LPAK 427.279469 29
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17
Then I think you can do something like the following to select the "good" ones. First create a "pdiff" (percentage difference) to see how close the nearest neighbor mass is:
ppm = .00001
df = df.sort('mass')
df['pdiff'] = (df.mass-df.mass.shift()) / df.mass
Unnamed: 0 frag mass prot_position pdiff
3 3 SPWPSMAR 417.258700 44 NaN
2 2 GTIK 417.258734 2 8.148421e-08
4 4 LPAK 427.279469 29 2.345241e-02
1 1 EPGANAIGMVAFK 1573.675700 29 7.284831e-01
0 0 TFDEHNAPNSNSNK 1573.675712 2 7.625459e-09
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17 2.817926e-01
The first and last lines of data make this a little trickier, so the next line terminates the first line and repeats the last line, so the next mask works correctly. This works for the example here, but it might need to be changed for other cases (but only for the first and last rows of data).
df = df.iloc[range(len(df))+[-1]].bfill()
df[ (df['pdiff'] > ppm) & (df['pdiff'].shift(-1) > ppm) ]
Results:
Unnamed: 0 frag mass prot_position pdiff
4 4 LPAK 427.279469 29 0.023452
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17 0.281793
Sorry, I don't understand the second part of the question.
Edit to add: As mentioned in the comment on @ AmiTavory's answer, I think perhaps the sorting and grouping approach could be combined for a simpler answer than this. I could try at a later time, but everyone should feel free to take this shot themselves, if interested.
source to share
Here's something slightly different from what you asked, but it's very simple and I think it gives a similar effect.
Using numpy.round
, you can create a new column
import numpy as np
df['roundedMass'] = np.round(df.mass, 6)
Then you can make groupby
slices on the rounded mass and use it nunique
to count the numbers in the group. Filter for groups of size 1.
So, the number of fragments per cell:
df.frag.groupby(np.round(df.mass, 6)).nunique()
source to share
Another solution might be to create a duplicate of your list (if you need to save it for further processing later), iterate over it, and remove any elements that don't match your rule (m1 and m2).
You will receive a new list with all the unique masses.
Just don't forget that if you need to use the original list later, you will need to use deepcopy.
source to share