true

I would like to reverse a dataframe with dummy variables. For example,

from df_input:

Course_01 Course_02 Course_03 
  0           0         1 
  1           0         0 
  0           1         0 

To df_output

   Course
0 03
1 01
2 02

I have been looking at the solution provided at Reconstruct a categorical variable from dummies in pandas but it did not work. Please, Any help would be much appreciated.

Many Thanks, Best Regards, Carlo

3 Answers 11

Suppose you have the following dummy DF:

In [152]: d
Out[152]:
    id  T_30  T_40  T_50
0  id1     0     1     1
1  id2     1     0     1

we can prepare the following helper Series:

    In [153]: v = pd.Series(d.columns.drop('id').str.replace(r'\D','').astype(int), index=d.columns.drop('id'))

In [155]: v
Out[155]:
T_30    30
T_40    40
T_50    50
dtype: int64

now we can multiply them, stack and filter:

In [154]: d.set_index('id').mul(v).stack().reset_index(name='T').drop('level_1',1).query("T > 0")
Out[154]:
    id   T
1  id1  40
2  id1  50
3  id2  30
5  id2  50
up vote 4 down vote accepted

We can use wide_to_long, then select rows that are not equal to zero i.e

ndf = pd.wide_to_long(df, stubnames='T_', i='id',j='T')

      T_
id  T     
id1 30   0
id2 30   1
id1 40   1
id2 40   0

not_dummy = ndf[ndf['T_'].ne(0)].reset_index().drop('T_',1)

   id   T
0  id2  30
1  id1  40

Update based on your edit :

ndf = pd.wide_to_long(df.reset_index(), stubnames='T_',i='index',j='T')

not_dummy = ndf[ndf['T_'].ne(0)].reset_index(level='T').drop('T_',1)

        T
index    
1      30
0      40
upvote
  flag
Thanks Dark. Unfortunately, when I apply your solution to my data, it generate more rows than it should. In particular, it create <id2,30>, <id2,40> and <id1,30, id1,40>, which is not the expected result. – Carlo Allocca
1 upvote
  flag
@CarloAllocca much better if you put sample of actual data in this kind of cases. – Dark
upvote
  flag
@CarloAllocca I have used the data not equal to 0, maybe you are not running that – Dark
upvote
  flag
wide_to_long not very often used function, plus1 ;) – jezrael
upvote
  flag
whats happening in drop('T_',1) – pyd
upvote
  flag
Thats an additional column created from wide_to_long due to stubnames='T_', so I'm dropping that at the end. – Dark

You can use:

#create id to index if necessary
df = df.set_index('id')
#create MultiIndex
df.columns = df.columns.str.split('_', expand=True)
#reshape by stack and remove 0 rows
df = df.stack().reset_index().query('T != 0').drop('T',1).rename(columns={'level_1':'T'})
print (df)
    id   T
1  id1  40
2  id2  30

EDIT:

col_name = 'Course' 
df.columns = df.columns.str.split('_', expand=True)
df = (df.replace(0, np.nan)
        .stack()
        .reset_index()

        .drop([col_name, 'level_0'],1)
        .rename(columns={'level_1':col_name})
)
print (df)
  Course
0     03
1     01
2     02
upvote
  flag
Thanks Jezrael. As Dark said, I think that I am providing wrong data. I am modifying the above description with actual data. – Carlo Allocca
upvote
  flag
@CarloAllocca Is it that big data? You put the data and expected output later, I have to go now, let me update the answer and go. – Dark
upvote
  flag
Thanks Dark. no it is not that big. it is just a sample. Many Thanks. – Carlo Allocca
upvote
  flag
@CarloAllocca Got to go, check the edit, hope it helps. If not these answerers might update much better one. All you need to remove in these answers is set_index() – Dark
upvote
  flag
Thanks Dark and Jezrael. I am sure that your solution are correct, but I don't know what I do wrong that when applied to my data, it does not provide the right solution. I decided to publish a sample of my data. – Carlo Allocca
upvote
  flag
Any solution is generating more rows that it should. – Carlo Allocca
upvote
  flag
answer was edited ;) – jezrael
upvote
  flag
Thanlk Jezrael. The issue is still on. Basically, I have a dataset of 377 rows and Course has 129 values. When I apply your script, I got a new dataset of 48761 rows which means 377x129. What am I doing wrong? – Carlo Allocca
upvote
  flag
@CarloAllocca - maybe output values are strings, then use instead df.replace(0, np.nan) -> df.replace('0', np.nan), add '' to 0 – jezrael
upvote
  flag
YESSS. My mistake. Thank you very much Jezrael. – Carlo Allocca
upvote
  flag
Hmmm, only one answer should me accepted ;) – jezrael
upvote
  flag

Not the answer you're looking for? Browse other questions tagged or ask your own question.