Jump to content

Answer Devices Blog


1000 Questions every mobile assistant should be able to answer

Posted by Admin in Answer Devices, 19 April 2013 · 1,122 views
benchmark, test suite and 3 more...

In June, 2012 a well known Apple commentator and critic, Piper Jaffray analyst Gene Munster, conducted a test of Siri with a suite of 1600 test questions.  He tested Siri by speaking the 1600 sample inputs in both a quiet environment and in a busy city environment filled with background noise.  The story was that Siri comprehends 83% of queries in noisy conditions, 89% in a quiet room, but that even when Siri's voice recognition correctly heard the input, the device responded accurately only 62% of the time in the noisy setting, and 68% in a quiet one.  

This got us thinking, it would be nice to have a standard benchmark suite to test mobile assistants and compare the results.  Piper Jaffray's test was proprietary and they were unable to release their test questions publicly.  We've assembled a list of 1000 sample inputs from various sources, including many of virtual assistant conversations published in various blogs, Youtube videos, and samples quoted from Apple TV commercials.  We've also added some anonymized data recorded with the CallMom app.  The result is a complete, open source suite of test inputs that can be used to grade the performance of a mobile virtual assistant.

Some of the benchmark samples are designed to test voice recognition accuracy.  For example, we've included "Call <somebody>" with several different names, some harder to recognize than others.  The benchmark covers 29 functional areas.  This is a list of the areas and a sample query from each.

1. Alarm: "Wake me up in 40 minutes"
2. Apps: "Launch Angry Birds"
3. Browser: "Browse to New York Times"
4. Calendar: "Cancel Golf today"
5. Camera: "Take a picture"
6. Clock: "What time is it?"
7. Contacts: "Dad's home number is 2125551212"
8. Device: "What is my battery level?"
9. Dial: "Call Mom"
10. Email: "Email Albert the meeting has started"
11. Games: "Play tic tac toe"
12. Help: "Can you translate?"
13. Jokes: "Tell me a joke"
14. Knowledge: "What is the longest river in the world?"
15. Map: "Find the nearest Starbucks"
16. Movies & TV: "When is the Real Housewives on TV?"
17. Music: "Play some Coltrane"
18. Personality: "Do you ever get tired?"
18. Pictures: "Show me a picture of a horse"
20. Profile: "My birthday is on Valentine's day"
21. Search: "Address and phone number for the Best Western hotel in San Mateo, California"
22. Settings: "Can I change your name?"
23. Shopping: "I need a new Jacket"
24. SMS: "Text Steven meet me at home"
25. Sports: "What is the score in the Spain-Italy game today?"
26. Stories: "Can you tell me a bedtime story"
27. Translation: "How do you say hello in French?"
28. Weather: "Do I need an umbrella today?"
29. Other: "Create a secure password"

The benchmark data is provided as a raw text file, with the inputs sorted alphabetically. If you have a mobile virtual assistant, you may wish to test the 1000 inputs (attached file) and report back the results to us.  We will post the results as a side-by-side comparison table for all the tested apps.  

Pandorabots has offered to run automated tests of your app, if you have an API.  They have a program that can automatically feed the inputs to an app and create a report of the results.  Pandorabots also has the capability to create even larger test suites, up to tens of thousands of inputs.   Please contact info@pandorabots.com for information.

Attached Files

  • 1 Total Blogs
  • 2 Total Entries
  • 0 Total Comments
  • Answer Devices Latest Blog
  • Admin Latest Blogger

0 user(s) are online (in the past 15 minutes)

0 members, 0 guests, 0 anonymous users