Greetings!

As you know from the my previous blogpost, I am working on Gensim library as a GSoC student this summer. In this post, I’d like to share my results, discoveries and fails during Week 1 & Week 2. In the end, I will describe my plan for next weeks (which is no longer aligned with original proposal timeline, unfortunately).

Intro

Initially, the problem I’m trying to solve was sounded like this: threads that train the model (workers) are highly optimized and therefore are blocked on Queue.get call here and wait until data producer threads (producers) fill the queue here.

In order to fix this, multistream API was proposed. The main idea is to read data in many threads instead of using single thread (call several job producers in parallel on different input streams). Then, user could split large data file into several parts and pass them as input streams to the model.

Single stream benchmarking

Note: You can find full results in my pull request, but in this blog post I will attach results related only to Word2Vec.

I started with benchmarking single input stream models performance.

# workers	total time (sec)	avg queue size	processing speed (words per sec)	sum cpu load (%)
1	1182.58	1.48	153350	104.35
4	313.03	7.09	579322	395.23
8	277.35	0.025	653863	456.55
10	275.24	0.040	658857	456.85
12	285.95	0.024	634182	449.99
14	288.64	0.017	628288	452.17

Up to 4 workers everything is okay:

Approx. linear speedup for 1-to-4 workers transition
~4 CPUs are fully utilized

But increasing number of workers to 8, 10, 12, 14 shows the problem with workers starvation:

Avg queue size is almost zero
Processing speed doesn’t increases linearly, it gets on a plateau.
CPUs are not fully utilized (each CPU is utilized by ~20-40%)

Multistream API and word2vec main discovery

I implemented a first version of multistream API (just modified BaseAny2VecModel._train_epoch method to this one). See full results here.

Table for 1 input stream (singlestream case), window = 5:

# workers	total time (sec)	avg queue size	processing speed (words per sec)	sum cpu load (%)
1	1030.84	1.409	175923	105.96
4	278.43	6.974	651315	400.33
8	255.39	0.019	710073	453.48
10	256.34	0.023	707447	450.06
12	255.86	0.023	708778	453.84
14	258.83	0.023	700652	450.32

Table for 2 input streams, window = 5:

# workers	total time (sec)	avg queue size	processing speed (words per sec)	sum cpu load (%)
1	1047.49	1.393	172963	106.34
4	298.42	6.477	606627	390.23
8	294.28	0.158	615720	400.45
10	301.32	0.107	599463	396.61
12	299.65	0.108	602971	396.26
14	304.75	0.099	591718	391.84

It turns out that multistream is not as good as it was supposed to be. Moreover, in some cases multiple streams hurts the performance in compare to single stream. And this is the case because python doesn’t allow to execute CPU-bound operations in parallel (it uses GIL, which prevents this). Therefore, if our _job_producer loop is more CPU-bound than IO-bound, our multistream optimization may not lead to performance boost.

This proves that _job_producer is actually CPU-bound, not IO-bound. So, when we create many producers, because of GIL, less resources are given to _worker_loop threads and performance doesn’t increase.

While I was fighting all these multithreading subtleties and resource contention things, I decided to run an experiment which doesn’t have any job producers at all but only worker threads. I wanted to measure maximum word2vec speed (no IO, only CPU model updates) which is an upper bound for my results in this project. In order to do this, I modified the code of _train_epoch to fill the queue with all data at first and then run worker threads (see the code here). Here are the results:

# workers	total time (sec)	processing speed (words per sec)	sum cpu load (%)
1	1047.96	173050	101.11
4	261.16	694407	386.01
8	197.32	919021	545.11
10	197.67	917400	536.40
12	201.49	900022	544.90
14	203.63	890559	546.92

What are these numbers talking about? They are saying, that even clean word2vec can’t scale linearly after 4 threads. So, we don’t have linear speed-up not because of IO problems, mainly because of word2vec throughput capabilities. But, nevertheless, word2vec with inmemory queue boosts up to +300k words per sec, so there is a space for optimizations.

Plan for Week 3 & Week 4

I concluded that _job_producer is CPU-bound, so there is a contention between producers and workers. I want to implement multistream using multiprocessing, not multithreading (there is no GIL between processed in python) and benchmark it.
Many users complain that vocabulary building stage is slow (see this Radim’s comment). I will parallelize it using multistream (likely multiple processes, not multiple threads because vocab building is CPU bound).

So guys, stay tuned and come back in few weeks! Feel free to reach me via telegram @persiyanov or email dmitry at persiyanov dot gmail dot com.