Using "mirror image" of cells to delete duplicate data?

**abcd1234** · 05-31-2007, 01:46 PM

Hello: I have a plain text data file that contains rows of data - each with 3 columns separated by spaces - that is a dataset for a visualization program (Cytoscape).

The data is of the form

a pi b
a pi c
a pi d
c pi a
etc.

where the first and last columns are genes, and the middle column is the relationship (interaction type) between them.

My problem is that the rows (a pi c) and (c pi a) are informationally equivalent (gene "a" interacts with gene "c" with interaction type "pi"), but both are mapped (displayed) in Cytoscape, adding superfluous redundancy to the visual images, e.g.

a=c (with two lines connecting genes a and c)

and not

a-c (with 1 line connecting genes a and c).

I would like to parse (delete) these duplicate data.

I imported (copied/pasted) the sorted data from Notepad into Excel (into two Excel files - there are ~120,000 rows, which exceeds Excel's maximum number of rows by ~2X), but I cannot figure out how to delete these duplicates.

I thought that if I reversed the order of the cells - the "reverse complement," i.e. (a pi b) converted to (b pi a) then combined the starting and transposed list, that I could sort out the duplicates (which I am able to do), but this action also generates new duplicates, a "vicious cycle."

What I really need to have, either in Excel (or a linux script?), is a script that interprets these actions:

for each row of data in the data file
if (reverse complement) = true then delete

That is, look at the list

a pi b
a pi c
a pi d
c pi a

if (b pi a) is present, then delete (a pi b)

and so on. In this example, when the last row is evaluated,

if (a pi c) is present, then delete (c pi a)

this "duplicate" (redundant) entry will be deleted.

Is there a simple way to do (code) this in Excel? And / or a linux script?

I apologize for the length of this message - I wanted to clearly explain the problem. If someone could provide a solution, I would really appreciate it! Sincerely, Greg S. :-)

**kraljb** · 05-31-2007, 02:43 PM

=IF(OR(SUMPRODUCT(--($A$1:$A1=$A1),--($B$1:$B1=$B1),--($C$1:$C1=$C1))>1,SUMPRODUCT(--($A$1:$A1=$C1),--($B$1:$B1=$B1),--($C$1:$C1=$A1))),"DUPLICATE","")

Copy this formula in column D (assuming your information is in Columns A-C) in Column E put =ROW()

Copy and Paste as Values Columns D and E

Sort Column D (to put all the DUPLICATE's together) Delete them.

Sort Column E so you have your original order back minus duplicates

Delete Column D and E and no more duplicates.

Hope that helps

-John

**mrice** · 05-31-2007, 02:51 PM

Here's one way you might do it...

Get a unique list of the different genes (presumably 100's).

Assign each one a different prime number

http://britton.disted.camosun.bc.ca/jbprimelist.htm

Use the vlookup function to line up the prime numbers against the pairs

e.g. a pi b might be 3 pi 93

Use a formula to calculate the product.

Sort by this product - only mirror images should have the same product.

Use the exact function on adjacent rows to look for matches.

Delete all rows which are identical to the one above.

You may need to do this independently on the two spreadsheets to get the numbers down and then combine them once they fit onto a single sheet.

Hope this makes sense.

**abcd1234** · 06-05-2007, 02:26 PM

Dear John ("kraljb") and Martin ("mrice"): Thank you very much for your *very quick* replies, that I very much appreciated! I have been very busy, and I just got back to this problem. I thought your answer, Martin, was very clever, but I tried John's method, that appears to work brilliantly!

Thank you both, again, for your kind help!

Sincerely, Greg ("abcd1234")

P.S. Here is a "trial run," using John's method, cut and pasted from MS Excel (unfortunately, the tabs are converted to spaces, below, but you get the idea):

[Columns:]
A B C D E F G H

[Source data:]
d pi c 1 1
g pi f 2 2
a pi b 3 3
b pi c 4 4
c pi d DUPLICATE 5 DUPLICATE 5
e pi f 6 6
f pi g DUPLICATE 7 DUPLICATE 7
DUPLICATE 8 DUPLICATE 8
d gi c 9 9
g tf f 10 10
b pi a DUPLICATE 11 DUPLICATE 11
b gi c 12 12
c tf d 13 13
e gi f 14 14
f tf g DUPLICATE 15 DUPLICATE 15
DUPLICATE 16 DUPLICATE 16
d pi c DUPLICATE 17 DUPLICATE 17
g pi f DUPLICATE 18 DUPLICATE 18
h pi j 19 19
i pi k 20 20
j pi l 21 21

with:

Cell types = General

Column D formula: =IF(OR(SUMPRODUCT(--($A$1:$A15=$A15),--($B$1:$B15=$B15),--($C$1:$C15=$C15))>1,SUMPRODUCT(--($A$1:$A15=$C15),--($B$1:$B15=$B15),--($C$1:$C15=$A15))),"DUPLICATE","")

Column E Formula: =ROW()-2
(I subtracted two, as I had two non-data rows at the top of my file.)

Columns (G,H) = Columns(D,E) copied and pasted as values: (Ctrl-E, S, V)

Rows above copied, pasted (here) as values (to illustrate the method):

d pi c 1 1
g pi f 2 2
a pi b 3 3
b pi c 4 4
c pi d DUPLICATE 5 DUPLICATE 5
e pi f 6 6
f pi g DUPLICATE 7 DUPLICATE 7
DUPLICATE 8 DUPLICATE 8
d gi c 9 9
g tf f 10 10
b pi a DUPLICATE 11 DUPLICATE 11
b gi c 12 12
c tf d 13 13
e gi f 14 14
f tf g DUPLICATE 15 DUPLICATE 15
DUPLICATE 16 DUPLICATE 16
d pi c DUPLICATE 17 DUPLICATE 17
g pi f DUPLICATE 18 DUPLICATE 18
h pi j 19 19
i pi k 20 20
j pi l 21 21

Sorted (descending) by Column G:

c pi d DUPLICATE 5 DUPLICATE 5
f pi g DUPLICATE 7 DUPLICATE 7
DUPLICATE 8 DUPLICATE 8
b pi a DUPLICATE 11 DUPLICATE 11
f tf g DUPLICATE 15 DUPLICATE 15
DUPLICATE 16 DUPLICATE 16
d pi c DUPLICATE 17 DUPLICATE 17
g pi f DUPLICATE 18 DUPLICATE 18
d pi c 1 1
g pi f 2 2
a pi b 3 3
b pi c 4 4
e pi f 6 6
d gi c 9 9
g tf f 10 10
b gi c 12 12
c tf d 13 13
e gi f 14 14
h pi j 19 19
i pi k 20 20
j pi l 21 21

DUPLICATES removed, then sorted (ascending) by Column H (giving the original data, minus duplicates! - all of the columns, except (A,B,C), can now be deleted):

d pi c 1 1
g pi f 2 2
a pi b 3 3
b pi c 4 4
e pi f 6 6
d gi c 9 9
g tf f 10 10
b gi c 12 12
c tf d 13 13
e gi f 14 14
h pi j 19 19
i pi k 20 20
j pi l 21 21

**abcd1234** · 06-11-2007, 09:12 AM

CORRECTION: In my 'thank you' reply I made an error in the quoted formula [copied from an Excel sheet 'comment' (non-data) cell, which had altered the formula:

Incorrect Column D formula:

=IF(OR(SUMPRODUCT(--($A$1:$A15=$A15),--($B$1:$B15=$B15),--($C$1:$C15=$C15))>1,SUMPRODUCT(--($A$1:$A15=$C15),--($B$1:$B15=$B15),--($C$1:$C15=$A15))),"DUPLICATE","")

The correct formula - as originally provided by John - is:

=IF(OR(SUMPRODUCT(--($A$1:$A1=$A1),--($B$1:$B1=$B1),--($C$1:$C1=$C1))>1,SUMPRODUCT(--($A$1:$A1=$C1),--($B$1:$B1=$B1),--($C$1:$C1=$A1))),"DUPLICATE","")

I discovered this error when double-checking some of my "mappings" before/after parsing the duplicates - the data pairs were "messed up." Thanks again, Greg :-)

Using "mirror image" of cells to delete duplicate data?

LinkBack

Thread Tools

Rate This Thread

Display

Using "mirror image" of cells to delete duplicate data?

Thank you John and Martin - Works!

Thread Information

Users Browsing this Thread

Bookmarks

Bookmarks

Posting Permissions