Sir why need to take options if we already had window size

abubakar_nsit · June 24, 2020, 10:54am

sir why need to take options in the self made data where we donot need any options list in while using GoogleNews-vectors-negative300.bin?
if we can do the same without taking the options, how we can do that?

prashant_ml · June 24, 2020, 5:32pm

hey @abubakar_nsit ,
i hope your learning is going well.

Can you please let me know , what options are you talking about ?

abubakar_nsit · June 24, 2020, 6:18pm

sir in word2vec,mini project tutorial, when prateek bhaiyya is making a program for word analogy on the own data.

here is the link for the video:

and here is the link for the code:

and please tell me why word analogy is not working on the complete vocab?and need to take options as prateek bhaiyya have taken.

prashant_ml · June 25, 2020, 7:40am

hey @abubakar_nsit ,
prateek bhaiya has just taken them for a learning purpose, it doesn’t mean you can’t use the whole vocabulary , he just wanted to show an example to students , that when we train our model on some data and query it to show results based on what it has learned .you can take the whole vocabulary to see what results do you get.
Like in your code above , you see result as played which is something different from the query tried.
So , like the same way , you can try different triads and see how it performs.

I hope this would have resolved your doubt.
Thank You and Happy Learning .

abubakar_nsit · June 25, 2020, 3:20pm

no sir! I have tried many examples but none of them are giving correct answers.

sir can u guide me how word2vec matrix of words is formed and and what basis similarity is found?and why it is not giving correct answers on the self prepared data.

prashant_ml · June 25, 2020, 4:51pm

hey @abubakar_nsit ,
See training word2vec model on custom data requires a lot of data , seriously a lot of data.
If you take an example of glove embeddings , they are trained on all Wikipedia articles and you can imagine how that data is.
your data is currently very very less , you can easily get more data but you will require computation power , there are deep learning methods like skip gram and common bag of words. you can learn a basic about from here.

there are many to ways to say about it. check the above link it will help you a bit about it.

basis of similarity is just distance between two vectors. In mathematical notation , two vectors are identical or similar to an extend depending upon the distance between them , its your choice what distance metric you wanna choose. The vector can be in n number of dimensions , just the other vector also need to be in same dimensional space.

As shown on video , prateek bhaiya has tried beforehand and showed you some examples from this data which show good results for just understanding .Even, the model Word2Vec has an inbuilt randomness in data preparation , so just in case if you try with the same method and parameters as bhaiya did , then also there is a chance that you can get different results. This is the reason , why it is generally said that we should use pre trained word vectors for such purposes and they provide better results and are consistent also.

Word2Vec doesn’t know what a word is , what does it means. It just tries to learn by the way it placed between sentences , between words and what words surround it. Based on these values , it generates a vector which represents that word in an n-dimensional space.

I hope this would helped you.

abubakar_nsit · June 25, 2020, 7:23pm

thank you very much sir!!