Comparison between one element and all other columns of the DataFrame

I have a list of tuples that I have turned into a DataFrame with thousands of rows, for example:

                                          frag         mass  prot_position
0                               TFDEHNAPNSNSNK  1573.675712              2
1                                EPGANAIGMVAFK  1303.659458             29
2                                         GTIK   417.258734              2
3                                     SPWPSMAR   930.438172             44
4                                         LPAK   427.279469             29
5                          NEDSFVVWEQIINSLSALK  2191.116099             17
...

      

and I have the following rule:

def are_dif(m1, m2, ppm=10):
    if abs((m1 - m2) / m1) < ppm * 0.000001:
        v = False
    else:
        v = True
    return v

      

So, I just want the "fragment" to have a mass that is different from the mass of all the other fragments. How can I achieve this "choice"?

Then I have a list named "pinfo" that contains:

d = {'id':id, 'seq':seq_code, "1HW_fit":hits_fit}
# one for each protein
# each dictionary as the position of the protein that it describes.

      

So, I want to sum 1 to the value "hits_fit" in the dictionary corresponding to the protein.

+3


source to share


3 answers


If I understand correctly (not sure if I have one), you can accomplish quite a bit just by sorting. First, let me tune the data for near and far values ​​for mass:

   Unnamed: 0                 frag         mass  prot_position
0           0       TFDEHNAPNSNSNK  1573.675712              2
1           1        EPGANAIGMVAFK  1573.675700             29
2           2                 GTIK   417.258734              2
3           3             SPWPSMAR   417.258700             44
4           4                 LPAK   427.279469             29
5           5  NEDSFVVWEQIINSLSALK  2191.116099             17

      

Then I think you can do something like the following to select the "good" ones. First create a "pdiff" (percentage difference) to see how close the nearest neighbor mass is:

ppm = .00001
df = df.sort('mass')

df['pdiff'] = (df.mass-df.mass.shift()) / df.mass

   Unnamed: 0                 frag         mass  prot_position         pdiff
3           3             SPWPSMAR   417.258700             44           NaN
2           2                 GTIK   417.258734              2  8.148421e-08
4           4                 LPAK   427.279469             29  2.345241e-02
1           1        EPGANAIGMVAFK  1573.675700             29  7.284831e-01
0           0       TFDEHNAPNSNSNK  1573.675712              2  7.625459e-09
5           5  NEDSFVVWEQIINSLSALK  2191.116099             17  2.817926e-01

      

The first and last lines of data make this a little trickier, so the next line terminates the first line and repeats the last line, so the next mask works correctly. This works for the example here, but it might need to be changed for other cases (but only for the first and last rows of data).



df = df.iloc[range(len(df))+[-1]].bfill()
df[ (df['pdiff'] > ppm) & (df['pdiff'].shift(-1) > ppm) ]

      

Results:

   Unnamed: 0                 frag         mass  prot_position     pdiff
4           4                 LPAK   427.279469             29  0.023452
5           5  NEDSFVVWEQIINSLSALK  2191.116099             17  0.281793

      

Sorry, I don't understand the second part of the question.

Edit to add: As mentioned in the comment on @ AmiTavory's answer, I think perhaps the sorting and grouping approach could be combined for a simpler answer than this. I could try at a later time, but everyone should feel free to take this shot themselves, if interested.

+2


source


Here's something slightly different from what you asked, but it's very simple and I think it gives a similar effect.

Using numpy.round

, you can create a new column

import numpy as np

df['roundedMass'] = np.round(df.mass, 6)

      



Then you can make groupby

slices on the rounded mass and use it nunique

to count the numbers in the group. Filter for groups of size 1.

So, the number of fragments per cell:

df.frag.groupby(np.round(df.mass, 6)).nunique()

      

+1


source


Another solution might be to create a duplicate of your list (if you need to save it for further processing later), iterate over it, and remove any elements that don't match your rule (m1 and m2).

You will receive a new list with all the unique masses.

Just don't forget that if you need to use the original list later, you will need to use deepcopy.

0


source







All Articles