Greetings!

Six hardworking weeks have passed since my last post and now, right before the second evaluation, I am happy to share my last results.

TL;DR; I’ve managed to optimize Word2Vec and achieved fully linear scale using multistream approach. Now, it’s 3x times faster than current Word2Vec in gensim/develop and 2x faster than original Mikolov’s word2vec implementation. See the numbers:

gensim/develop version
Mikolov’s version
New multistream training optimized with Cython

Also, I’ve optimized vocabulary building using multiprocessing module and multistream. See my pull request.

Plan for the last month

For the last month there is a lot of work to deliver my feature to develop-branch ready stage.

See you in the next blogpost! Feel free to reach me via telegram @persiyanov or email dmitry dot persiyanov at gmail dot com.