Comments on: How to One Hot Encode Categorical Variables of a Large Dataset in Python? https://yashuseth.blog/2017/12/14/how-to-one-hot-encode-categorical-variables-of-a-large-dataset-in-python/ A blog on data science, machine learning and artificial intelligence. Mon, 26 Nov 2018 14:18:17 +0000 hourly 1 http://wordpress.com/ By: Yashu Seth https://yashuseth.blog/2017/12/14/how-to-one-hot-encode-categorical-variables-of-a-large-dataset-in-python/comment-page-1/#comment-93 Mon, 26 Nov 2018 14:18:17 +0000 http://yashuseth.wordpress.com/?p=85#comment-93 “doesn’t this trigger information leaking?”
Yes this can cause information leaking. But, we generally work under the assumption that we will not get new levels in our test (real world) data. When we split our data into train and validation this can cause some levels to be present in only one of the splits. This is happening because we made a split manually and not because our general (real world) data has this property. Hence, we can fit the encoder on both train and validation data. (By fitting on the test data I mean the manually created validation data). And anyways we will train our model on both (train and validation) the data before predicting on the test (real world) data, hence this can be convenient.

“What is the best way to approach this problem when fitting a model in a real world situation, where the test set is unseen?”
In a real world scenario, this is anyways out of our control. Since there might be new levels of a feature that we have never trained on. Hence, there is no other option but to ignore those levels. The current version of the library will give you all zeros for such cases.

Like

]]>
By: Allen https://yashuseth.blog/2017/12/14/how-to-one-hot-encode-categorical-variables-of-a-large-dataset-in-python/comment-page-1/#comment-84 Mon, 12 Nov 2018 21:41:58 +0000 http://yashuseth.wordpress.com/?p=85#comment-84 Hi Yashu, thanks sharing your amazing work. This is really helpful. I have a question on fitting the encoder on test set:
doesn’t this trigger information leaking?
What is the best way to approach this problem when fitting a model in a real world situation, where the test set is unseen?
Many thanks and looking forward for your reply
Allen

Like

]]>
By: Understanding Time Series Modelling and Forecasting – Part 1 – Let the Machines Learn https://yashuseth.blog/2017/12/14/how-to-one-hot-encode-categorical-variables-of-a-large-dataset-in-python/comment-page-1/#comment-6 Thu, 18 Jan 2018 19:49:32 +0000 http://yashuseth.wordpress.com/?p=85#comment-6 […] How to one hot encode categorical variables of a large dataset in Python? […]

Like

]]>