+ Reply to Thread
Results 1 to 5 of 5

Using "mirror image" of cells to delete duplicate data?

  1. #1
    Registered User
    Join Date
    03-11-2006
    Posts
    11

    Question Using "mirror image" of cells to delete duplicate data?

    Hello: I have a plain text data file that contains rows of data - each with 3 columns separated by spaces - that is a dataset for a visualization program (Cytoscape).

    The data is of the form

    a pi b
    a pi c
    a pi d
    c pi a
    etc.

    where the first and last columns are genes, and the middle column is the relationship (interaction type) between them.

    My problem is that the rows (a pi c) and (c pi a) are informationally equivalent (gene "a" interacts with gene "c" with interaction type "pi"), but both are mapped (displayed) in Cytoscape, adding superfluous redundancy to the visual images, e.g.

    a=c (with two lines connecting genes a and c)

    and not

    a-c (with 1 line connecting genes a and c).

    I would like to parse (delete) these duplicate data.

    I imported (copied/pasted) the sorted data from Notepad into Excel (into two Excel files - there are ~120,000 rows, which exceeds Excel's maximum number of rows by ~2X), but I cannot figure out how to delete these duplicates.

    I thought that if I reversed the order of the cells - the "reverse complement," i.e. (a pi b) converted to (b pi a) then combined the starting and transposed list, that I could sort out the duplicates (which I am able to do), but this action also generates new duplicates, a "vicious cycle."

    What I really need to have, either in Excel (or a linux script?), is a script that interprets these actions:

    for each row of data in the data file
    if (reverse complement) = true then delete

    That is, look at the list

    a pi b
    a pi c
    a pi d
    c pi a

    if (b pi a) is present, then delete (a pi b)

    and so on. In this example, when the last row is evaluated,

    if (a pi c) is present, then delete (c pi a)

    this "duplicate" (redundant) entry will be deleted.

    Is there a simple way to do (code) this in Excel? And / or a linux script?

    I apologize for the length of this message - I wanted to clearly explain the problem. If someone could provide a solution, I would really appreciate it! Sincerely, Greg S. :-)

  2. #2
    Forum Contributor kraljb's Avatar
    Join Date
    05-26-2004
    Location
    Illinois
    MS-Off Ver
    2007 (recent change)
    Posts
    256
    =IF(OR(SUMPRODUCT(--($A$1:$A1=$A1),--($B$1:$B1=$B1),--($C$1:$C1=$C1))>1,SUMPRODUCT(--($A$1:$A1=$C1),--($B$1:$B1=$B1),--($C$1:$C1=$A1))),"DUPLICATE","")

    Copy this formula in column D (assuming your information is in Columns A-C) in Column E put =ROW()

    Copy and Paste as Values Columns D and E

    Sort Column D (to put all the DUPLICATE's together) Delete them.

    Sort Column E so you have your original order back minus duplicates

    Delete Column D and E and no more duplicates.

    Hope that helps

    -John

  3. #3
    Forum Expert mrice's Avatar
    Join Date
    06-22-2004
    Location
    Surrey, England
    MS-Off Ver
    Excel 2013
    Posts
    4,967
    Here's one way you might do it...

    Get a unique list of the different genes (presumably 100's).

    Assign each one a different prime number

    http://britton.disted.camosun.bc.ca/jbprimelist.htm

    Use the vlookup function to line up the prime numbers against the pairs

    e.g. a pi b might be 3 pi 93

    Use a formula to calculate the product.

    Sort by this product - only mirror images should have the same product.

    Use the exact function on adjacent rows to look for matches.

    Delete all rows which are identical to the one above.

    You may need to do this independently on the two spreadsheets to get the numbers down and then combine them once they fit onto a single sheet.

    Hope this makes sense.
    Martin

  4. #4
    Registered User
    Join Date
    03-11-2006
    Posts
    11

    Smile Thank you John and Martin - Works!

    Dear John ("kraljb") and Martin ("mrice"): Thank you very much for your *very quick* replies, that I very much appreciated! I have been very busy, and I just got back to this problem. I thought your answer, Martin, was very clever, but I tried John's method, that appears to work brilliantly!

    Thank you both, again, for your kind help!

    Sincerely, Greg ("abcd1234")

    P.S. Here is a "trial run," using John's method, cut and pasted from MS Excel (unfortunately, the tabs are converted to spaces, below, but you get the idea):

    [Columns:]
    A B C D E F G H

    [Source data:]
    d pi c 1 1
    g pi f 2 2
    a pi b 3 3
    b pi c 4 4
    c pi d DUPLICATE 5 DUPLICATE 5
    e pi f 6 6
    f pi g DUPLICATE 7 DUPLICATE 7
    DUPLICATE 8 DUPLICATE 8
    d gi c 9 9
    g tf f 10 10
    b pi a DUPLICATE 11 DUPLICATE 11
    b gi c 12 12
    c tf d 13 13
    e gi f 14 14
    f tf g DUPLICATE 15 DUPLICATE 15
    DUPLICATE 16 DUPLICATE 16
    d pi c DUPLICATE 17 DUPLICATE 17
    g pi f DUPLICATE 18 DUPLICATE 18
    h pi j 19 19
    i pi k 20 20
    j pi l 21 21

    with:

    Cell types = General

    Column D formula: =IF(OR(SUMPRODUCT(--($A$1:$A15=$A15),--($B$1:$B15=$B15),--($C$1:$C15=$C15))>1,SUMPRODUCT(--($A$1:$A15=$C15),--($B$1:$B15=$B15),--($C$1:$C15=$A15))),"DUPLICATE","")

    Column E Formula: =ROW()-2
    (I subtracted two, as I had two non-data rows at the top of my file.)

    Columns (G,H) = Columns(D,E) copied and pasted as values: (Ctrl-E, S, V)

    Rows above copied, pasted (here) as values (to illustrate the method):

    d pi c 1 1
    g pi f 2 2
    a pi b 3 3
    b pi c 4 4
    c pi d DUPLICATE 5 DUPLICATE 5
    e pi f 6 6
    f pi g DUPLICATE 7 DUPLICATE 7
    DUPLICATE 8 DUPLICATE 8
    d gi c 9 9
    g tf f 10 10
    b pi a DUPLICATE 11 DUPLICATE 11
    b gi c 12 12
    c tf d 13 13
    e gi f 14 14
    f tf g DUPLICATE 15 DUPLICATE 15
    DUPLICATE 16 DUPLICATE 16
    d pi c DUPLICATE 17 DUPLICATE 17
    g pi f DUPLICATE 18 DUPLICATE 18
    h pi j 19 19
    i pi k 20 20
    j pi l 21 21


    Sorted (descending) by Column G:

    c pi d DUPLICATE 5 DUPLICATE 5
    f pi g DUPLICATE 7 DUPLICATE 7
    DUPLICATE 8 DUPLICATE 8
    b pi a DUPLICATE 11 DUPLICATE 11
    f tf g DUPLICATE 15 DUPLICATE 15
    DUPLICATE 16 DUPLICATE 16
    d pi c DUPLICATE 17 DUPLICATE 17
    g pi f DUPLICATE 18 DUPLICATE 18
    d pi c 1 1
    g pi f 2 2
    a pi b 3 3
    b pi c 4 4
    e pi f 6 6
    d gi c 9 9
    g tf f 10 10
    b gi c 12 12
    c tf d 13 13
    e gi f 14 14
    h pi j 19 19
    i pi k 20 20
    j pi l 21 21


    DUPLICATES removed, then sorted (ascending) by Column H (giving the original data, minus duplicates! - all of the columns, except (A,B,C), can now be deleted):

    d pi c 1 1
    g pi f 2 2
    a pi b 3 3
    b pi c 4 4
    e pi f 6 6
    d gi c 9 9
    g tf f 10 10
    b gi c 12 12
    c tf d 13 13
    e gi f 14 14
    h pi j 19 19
    i pi k 20 20
    j pi l 21 21

  5. #5
    Registered User
    Join Date
    03-11-2006
    Posts
    11
    CORRECTION: In my 'thank you' reply I made an error in the quoted formula [copied from an Excel sheet 'comment' (non-data) cell, which had altered the formula:

    Incorrect Column D formula:

    =IF(OR(SUMPRODUCT(--($A$1:$A15=$A15),--($B$1:$B15=$B15),--($C$1:$C15=$C15))>1,SUMPRODUCT(--($A$1:$A15=$C15),--($B$1:$B15=$B15),--($C$1:$C15=$A15))),"DUPLICATE","")


    The correct formula - as originally provided by John - is:

    =IF(OR(SUMPRODUCT(--($A$1:$A1=$A1),--($B$1:$B1=$B1),--($C$1:$C1=$C1))>1,SUMPRODUCT(--($A$1:$A1=$C1),--($B$1:$B1=$B1),--($C$1:$C1=$A1))),"DUPLICATE","")

    I discovered this error when double-checking some of my "mappings" before/after parsing the duplicates - the data pairs were "messed up." Thanks again, Greg :-)

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1