Knowledge-based QA system not giving the most appropriate answer
I am working on a project which is mostly knowledge based. My system takes a request from a user, downloads the relevant documents from Wikipedia, strips all html tags, and extracts plain text. After that, it marks the document in sentences, then forms the term-document (TD) matrix (the request is also transmitted as a sentence). This TD matrix is then sent to the pLSA (Probabilistic Latent Simulation Analysis) algorithm. Then, finally, calculates the cosine similarity between the vectors of the document (sentence) with the query vector. Based on the similarity to the request vector, the most relevant offer is displayed as the response. (Stemming is also performed when forming the TD Matrix). The problem is that it displays the result, but not the most relevant one. Where am I going wrong? Is the strategy I am following the right one,or is there some other algorithm that can help? Below I will show part of the Question and their answers returned by my system:
What is photosynthesis? ANSWER 1 : The stroma contains stacks (grana) of thylakoids, which are the site of photosynthesis ANSWER 2 : Factors leaf is the primary site of photosynthesis in plants ANSWER 3 : Samuel Ruben and Martin Kamen used radioactive isotopes to determine that the oxygen liberated in photosynthesis came from the water ANSWER 4 : In plants, algae and cyanobacteria, photosynthesis releases oxygen
What is Artificial Intelligence? ANSWER 1 : the problem of creating 'artificial intelligence' will substantially be solved" ANSWER 2 : 37 The leading-edge definition of artificial intelligence research is changing over time ANSWER 3 : Stories of these creatures and their fates discuss many of the same hopes, fears and ethical concerns that are presented by artificial intelligence ANSWER 4 : History of artificial intelligence and Timeline of artificial intelligence Thinking machines and artificial beings appear in Greek myths , such as Talos of Crete , the bronze robot of Hephaestus , and Pygmalion Galatea 13 Human likenesses believed to have intelligence were built in every major civilization
Who is a hacker? ANSWER 1 : 19 Hackers (short stories) Helba from the ANSWER 2 : 16 Rafael NÃºÃ±ez aka RaFa was a notorious most wanted hacker by the FBI since 2001 ANSWER 3 : Often, this type of 'white hat' hacker is called an ethical hacker ANSWER 4 : Hackers also commonly use port scanners
one more launch
What is biology? ANSWER 1 : Molecular biology is the study of biology at a molecular level ANSWER 2 : molecular biology studies the complex interactions of systems of biological molecules ANSWER 3 : The similarities and differences between cell types are particularly relevant to molecular biology ANSWER 4 : Contents History Foundations of modern biology 2
source to share
I think it will be difficult to improve your system if you keep the full statistical approach. From a statistical NLP perspective, you are really doing the right thing. Now you can adjust some parameters. To do this, you have to build a training corpus telling the system which answer is correct ... and then see what value the parameter should take to give you that answer.
That being said, I don't think the fine tuning parameters will increase your accuracy more than 20% ~ 30%.
If you want to go further, you need a more semantic approach and represent knowledge symbolically. Check for example http://www.jfsowa.com/
source to share
This is a well researched issue called Question Answer (QA). I have provided a summary of QA in another answer . In particular, all of your examples fall under the category of "definition questions" under the TREC . I suggest that you check out some of the articles resulting from asking "TREC Definition Questions" on Google or Google Scholar for ideas.
source to share