One of my interests has always been to build a more accurate 3D model using computer vision. In the not too distant past, I have thought that it would be cool to do, but it won't help me graduate, and I'm not sure what I would use it for anyway. Now, I still think that it probably won't help me graduate, but at least I came up with a use for it besides the arm/mobile manipulation project.
I tend to see two sides in the papers on affordances I'm reading. Some go with a rich model approach and learn to recognize a rather small set of objects. Some go with no model and use observation. It seems like people use a little of both. Or perhaps people learn a rich model using observation. At any rate, people don't exclusively do one thing... we use models and we learn. The big question is how to build up the model.
It seems like one component of that model would be 3D structure. I have looked at a few "environment reconstruction from cameras" papers, and they don't look trivial to implement, as well as having results that are a little lacking. While reading today, I remembered a relatively old algorithm that Philip and I implemented in part for a class project... voxel coloring. Now it's ridiculously slow and consumes a lot of memory (slight improvement with paper called real-time voxel coloring), but the end results are better than any stereo vision algorithm I have come across (although I haven't looked in depth at multi-view reconstruction). The big trick is that you have to segment out the object of interest, and you have to know where the cameras are with a rather high degree of accuracy. It's the kind of algorithm that's simple in theory (and simulation), but becomes extremely difficult to make robust in practice.
Limitations/assumptions notwithstanding, I think this might be a good approach for a robot to build up a 3d model of an object it's interested in. The robot requires mobility (or big-brother multiple cameras), but I think that's a reasonable requirement for any project that truly wants to learn affordances of objects. I envision an approach where the robot identifies an object of interest, takes a picture, then moves around it and takes pictures at say, 10 degree intervals while tracking the object. From that point voxel coloring (or another reconstruction algorithm) can be applied to build a 3D model.
I actually had this idea before we got the SwissRanger in our lab as a way to build up a good 3D model with only a single monocular camera. But, as I said before, it probably won't help me graduate to implement such code, so whether I actually implement such a thing is still undecided.
Tuesday, November 17, 2009
Thursday, November 12, 2009
Inverse affordances
While thinking of hand manipulation affordances, an idea came to mind. I'll call it inverse affordances for now. The idea is something like this: the user wants to perform such and such an action on an object, so what tool or procedure would be useful for doing such an action? An example case would be unscrewing a bolt, so what size wrench (8mm? 12mm?) would be best for that task? Or a screwdriver... what size screwdriver do I need? I can see this being especially useful for telemanipulation, even something like space maintenance/construction. Hey, I'd even like to pull out a cameraphone, point it at whatever bolt I'm trying to unscrew, and have it tell me what size wrench I need. From that standpoint it could be a "mechanic's assistant."
Now, I'm not going to try to solve this problem from start to finish right now, it might be interesting to setup a sort of Woz study, where all of the objects are annotated manually ahead of time, and then the user interacts with the system.
Now, I'm not going to try to solve this problem from start to finish right now, it might be interesting to setup a sort of Woz study, where all of the objects are annotated manually ahead of time, and then the user interacts with the system.
Thursday, October 1, 2009
Affordance Maps
We have come up with an overarching theme for my past, present, and future research: affordance maps. In terms of interacting with the physical world, objects afford certain actions. A common example is that a chair affords sitting. In computer vision, classifying a chair is a difficult problem, because chairs have highly varied form. For any object that affords sitting, saying that the object is a chair is quite likely not completely wrong, even if it isn't the best answer. For example, a table affords sitting, so it could be classified as a chair. Such an answer might get a chuckle out of another person, but only because it's a somewhat unusual answer, and not entirely wrong.
In this sense we can see that most objects afford many actions, and some actions are afforded by several things. A chair affords sitting, standing, pushing, pounding, kicking, etc. Throwing is afforded by almost all objects, at least for some person or machine (most people cannot throw a car, but a construction crane could). Any object will have an affordance ranking or preference pattern. For a cup, this might be drinking, pouring, drumming, trapping insects, in order from highest to lowest preference.
Extending this idea, we can consider the notion of social affordances. That is, given the current state of the world, what social actions are permitted or acceptable? Discovering this kind of thing automatically is certainly daunting, and the only idea I have so far is to classify human social actions then study and learn from humans interacting with each other and with the robot.
In this sense we can see that most objects afford many actions, and some actions are afforded by several things. A chair affords sitting, standing, pushing, pounding, kicking, etc. Throwing is afforded by almost all objects, at least for some person or machine (most people cannot throw a car, but a construction crane could). Any object will have an affordance ranking or preference pattern. For a cup, this might be drinking, pouring, drumming, trapping insects, in order from highest to lowest preference.
Extending this idea, we can consider the notion of social affordances. That is, given the current state of the world, what social actions are permitted or acceptable? Discovering this kind of thing automatically is certainly daunting, and the only idea I have so far is to classify human social actions then study and learn from humans interacting with each other and with the robot.
Thursday, August 27, 2009
Head tracking design changes
I spent some time the past couple of days refining the head tracking manipulation interface.
The first change was to mount the Wii remote up on the wall so it's farther away from the operator.
This change was motivated by the operator easily moving out of the camera's field of view when leaning left and right to change the view.
One side effect of this change due to the camera pointing at a steep downward angle is that "leaning closer in" now adjusts the declination of the virtual camera, instead of the zoom distance.
At first I thought I should modify the trig that calculates the head position, but first I decided to test it.
The result is I think I like the "lean in" to adjust the declination.
My explanation for why I prefer this way is adjusting zoom is rarely needed for this task, and adjusting declination is needed much more often.
Another change I made was to couple the virtual camera azimuth with the base joint rotation of the arm.
This means that the operator can sit still and rotate the arm, and the view keeps the arm in the same orientation by rotating the virtual camera.
The head tracking comes into play by offsetting from the coupled view.
This essentially means only relatively small head motions are required to get the most useful viewpoints (top down and side views).
Speaking of viewpoints reminded me of a configuration I should probably compare against.
Several people have asked whether I have tried displaying a side and top-down view at the same time.
I think it might be fast and usable in clean environments where you can isolate your object of interest, but cluttered environments would make it impossible to use such a display without an additional "3/4" dynamic view to understand what blob corresponds with what object.
It's probably something worth including for the journal paper I plan on writing.
The first change was to mount the Wii remote up on the wall so it's farther away from the operator.
This change was motivated by the operator easily moving out of the camera's field of view when leaning left and right to change the view.
One side effect of this change due to the camera pointing at a steep downward angle is that "leaning closer in" now adjusts the declination of the virtual camera, instead of the zoom distance.
At first I thought I should modify the trig that calculates the head position, but first I decided to test it.
The result is I think I like the "lean in" to adjust the declination.
My explanation for why I prefer this way is adjusting zoom is rarely needed for this task, and adjusting declination is needed much more often.
Another change I made was to couple the virtual camera azimuth with the base joint rotation of the arm.
This means that the operator can sit still and rotate the arm, and the view keeps the arm in the same orientation by rotating the virtual camera.
The head tracking comes into play by offsetting from the coupled view.
This essentially means only relatively small head motions are required to get the most useful viewpoints (top down and side views).
Speaking of viewpoints reminded me of a configuration I should probably compare against.
Several people have asked whether I have tried displaying a side and top-down view at the same time.
I think it might be fast and usable in clean environments where you can isolate your object of interest, but cluttered environments would make it impossible to use such a display without an additional "3/4" dynamic view to understand what blob corresponds with what object.
It's probably something worth including for the journal paper I plan on writing.
Labels:
design,
head tracking,
journal,
user interface
Wednesday, August 26, 2009
Erroneous artifacts in 3D display
Having recently finished annotating all of the video for the second user study, I noticed a few issues.
None of the issues I saw in the second user study were as severe as the first user study, but they are interesting.
Two people, that with my subjective judgment were complete novices to robot control and interpreting 3D information on a 2D display, mistook some artifacting in the 3D model as the deposit box, and so repeatedly dropped blocks on the artifacts.
The artifacts were showing because of an imperfect filter that is supposed to remove all parts of the 3D model that are not relevant to the task (that is, the blocks, pipes, and deposit box).
Some of the floor was showing up in the model, and these two subjects seemed to think it looked like the box.
If I were designing an interface to cater specifically to this task, I can think of ways to support the operator that would really make the deposit box stand out.
I don't think that's really what I'm researching, though, so I'm not going to change the interface design in that way, especially since 30 out of 32 people had no trouble finding the deposit box.
Most likely nobody will want to use a mobile manipulator for this particular task, since it's really mostly a toy world.
Another problem is a remnant from the first user study.
Yes, we're coming back to the alignment issue.
For one target block in one particular layout, the alignment was off by enough that at least half of the people had trouble getting it.
There were a couple other blocks that were slightly off, but most people got them as long as they followed the instructions in the training.
What it amounts to is that the calibration was slightly off for those couple of regions.
It's a little disappointing, but not too much so, since I can filter out the problem blocks to look at what happens with the well-aligned blocks.
Since there are 6 layouts and 3 blocks per layout, that means that only 1 or 2 block samples out of 18 is bad.
I think it's still plenty usable and will give some interesting insights.
None of the issues I saw in the second user study were as severe as the first user study, but they are interesting.
Two people, that with my subjective judgment were complete novices to robot control and interpreting 3D information on a 2D display, mistook some artifacting in the 3D model as the deposit box, and so repeatedly dropped blocks on the artifacts.
The artifacts were showing because of an imperfect filter that is supposed to remove all parts of the 3D model that are not relevant to the task (that is, the blocks, pipes, and deposit box).
Some of the floor was showing up in the model, and these two subjects seemed to think it looked like the box.
If I were designing an interface to cater specifically to this task, I can think of ways to support the operator that would really make the deposit box stand out.
I don't think that's really what I'm researching, though, so I'm not going to change the interface design in that way, especially since 30 out of 32 people had no trouble finding the deposit box.
Most likely nobody will want to use a mobile manipulator for this particular task, since it's really mostly a toy world.
Another problem is a remnant from the first user study.
Yes, we're coming back to the alignment issue.
For one target block in one particular layout, the alignment was off by enough that at least half of the people had trouble getting it.
There were a couple other blocks that were slightly off, but most people got them as long as they followed the instructions in the training.
What it amounts to is that the calibration was slightly off for those couple of regions.
It's a little disappointing, but not too much so, since I can filter out the problem blocks to look at what happens with the well-aligned blocks.
Since there are 6 layouts and 3 blocks per layout, that means that only 1 or 2 block samples out of 18 is bad.
I think it's still plenty usable and will give some interesting insights.
Labels:
calibration,
user interface,
user study
Tuesday, January 27, 2009
Novices testing interfaces that only experts will use
One of the sad ironies of my research (at least at this point) is that experts are an expensive, limited resource, and novices are readily available. So, which would you pick? We have a few strategies in mind to in a way get a little more than just novice results, but the stuff that will be viewed as statistically significant will be done by novices.
The sadness for me in this case comes from the view-dependent control I talked about in my previous post. I mentioned a hybrid joint control/end effector control that is not view-dependent, as it seems to most people that the view-dependence was confusing. People who had a bit more experience with working in a 3D world on a 2D screen seemed to like it more, but that's just anecdotal, and I might just be hoping. I certainly like it better. It's my impression that people would perform better with the view-dependent control after a bit of training time (like, a couple hours, not a couple minutes).
Now, I'm talking about the control type I used for the first user study, where two separate joysticks are used. The single-joystick view-dependent control was just plain confusing with head tracking. With two sticks, one is view-dependent, and the other is always up/down. That's my favorite configuration so far, but I understand the system pretty well, and I'm designing to my preferences.
The answers to these questions really can only be found through science and testing. I really need to keep moving!
The sadness for me in this case comes from the view-dependent control I talked about in my previous post. I mentioned a hybrid joint control/end effector control that is not view-dependent, as it seems to most people that the view-dependence was confusing. People who had a bit more experience with working in a 3D world on a 2D screen seemed to like it more, but that's just anecdotal, and I might just be hoping. I certainly like it better. It's my impression that people would perform better with the view-dependent control after a bit of training time (like, a couple hours, not a couple minutes).
Now, I'm talking about the control type I used for the first user study, where two separate joysticks are used. The single-joystick view-dependent control was just plain confusing with head tracking. With two sticks, one is view-dependent, and the other is always up/down. That's my favorite configuration so far, but I understand the system pretty well, and I'm designing to my preferences.
The answers to these questions really can only be found through science and testing. I really need to keep moving!
Labels:
control,
design,
head tracking,
user interface,
user study
Monday, January 26, 2009
This robot is out of control!
One of the things we robo-manipulation guys have to deal with is how to design the user control for the robot arm. I'm not talking about the whole user interface, but in particular the controls to make the robot move. If there's some autonomy in there, you have a little bit more flexibility, and you can fairly effectively just use a mouse. But when you just want to allow the operator to teleoperate or "remote control" the arm, things are a little trickier.
Now, just driving a wheeled robot around is not so challenging, because people are accustomed to driving cars, and sometimes even from a remote perspective. With a robot arm, you have to make a control that operates in full 3D, and that can even include 3 axes of rotation. If you want to go really low level, then you need some way to control each joint of the arm.
My first attempt at it was to use two "thumbsticks" on a modern video game controller. The type of control I'm going for is end-effector control, where you move a virtual target point for the robot's gripper to reach. One thumbstick controls motion in one plane, and the other stick controls motion in the remaining axis. This works OK with some training, but a lot of people seemed to still struggle with it. The next attempt was to reduce the control to one stick, and change the way the controls work depending on the view. From a top-down view, the end effector moves in the XY plane (where Z is vertical), and from a side view, the end effector moves in a plane parallel to the Z axis. The tricky part is in views that are neither side or top views. When the view is halfway between those, how should the end effector move? For now I have it somewhat "remember" what control it's in, and you have to go almost all the way to the other view to switch modes. This ends up being rather confusing even for me after practicing for a while. Another option would be to simply move the end effector in the view plane. It's hard to say whether that would be good.
Some other ideas would be to make a non-standard end effector control approach. For instance, left and right on the stick could rotate the base of the arm, and forward and back would extend the arm. Instead of controlling individual joints, however, this mode would still be moving the end effector. There is a difference.
The other day one of the people on my committee offered his Novint Falcon (3D force feedback controller) for controlling the robot arm. At first I thought this was totally the way to go and would solve all of my problems including world hunger. This would mean that I wouldn't have to use two separate joysticks or different modes... with a 3D controller up is up, left is left, and back is back.
Then I remembered the whole view-dependent thing. The trouble is that the interface has head tracking to adjust the view. So it's really easy to adjust the view. It's possibly good and important, but it also makes one wonder whether the controls should always do the same thing, or depend on the view. There's a paper by Jose Macedo called "The Effect of Automated Compensation for Incongruent Axes on Teleoperator Performance" that talks about this, and they basically say that people do better with the automatic compensation (or as I say, view-dependent) control than without.
I think my situation's a little different than theirs, though. They evaluate 2D control, and it's also static. By static I mean that once they have the control axes and the display axes determined, they remain in their particular (mis)alignment for the duration of the experiment. In my case, it's 3D control, and the alignment between axes is dynamic throughout the experiment. So I think it needs to be tested. Perhaps after I get this thesis done.
Now, just driving a wheeled robot around is not so challenging, because people are accustomed to driving cars, and sometimes even from a remote perspective. With a robot arm, you have to make a control that operates in full 3D, and that can even include 3 axes of rotation. If you want to go really low level, then you need some way to control each joint of the arm.
My first attempt at it was to use two "thumbsticks" on a modern video game controller. The type of control I'm going for is end-effector control, where you move a virtual target point for the robot's gripper to reach. One thumbstick controls motion in one plane, and the other stick controls motion in the remaining axis. This works OK with some training, but a lot of people seemed to still struggle with it. The next attempt was to reduce the control to one stick, and change the way the controls work depending on the view. From a top-down view, the end effector moves in the XY plane (where Z is vertical), and from a side view, the end effector moves in a plane parallel to the Z axis. The tricky part is in views that are neither side or top views. When the view is halfway between those, how should the end effector move? For now I have it somewhat "remember" what control it's in, and you have to go almost all the way to the other view to switch modes. This ends up being rather confusing even for me after practicing for a while. Another option would be to simply move the end effector in the view plane. It's hard to say whether that would be good.
Some other ideas would be to make a non-standard end effector control approach. For instance, left and right on the stick could rotate the base of the arm, and forward and back would extend the arm. Instead of controlling individual joints, however, this mode would still be moving the end effector. There is a difference.
The other day one of the people on my committee offered his Novint Falcon (3D force feedback controller) for controlling the robot arm. At first I thought this was totally the way to go and would solve all of my problems including world hunger. This would mean that I wouldn't have to use two separate joysticks or different modes... with a 3D controller up is up, left is left, and back is back.
Then I remembered the whole view-dependent thing. The trouble is that the interface has head tracking to adjust the view. So it's really easy to adjust the view. It's possibly good and important, but it also makes one wonder whether the controls should always do the same thing, or depend on the view. There's a paper by Jose Macedo called "The Effect of Automated Compensation for Incongruent Axes on Teleoperator Performance" that talks about this, and they basically say that people do better with the automatic compensation (or as I say, view-dependent) control than without.
I think my situation's a little different than theirs, though. They evaluate 2D control, and it's also static. By static I mean that once they have the control axes and the display axes determined, they remain in their particular (mis)alignment for the duration of the experiment. In my case, it's 3D control, and the alignment between axes is dynamic throughout the experiment. So I think it needs to be tested. Perhaps after I get this thesis done.
Tuesday, January 13, 2009
Filtering the 3D scan
For the 3D scan display, we don't necessarily want to show every point that the camera sees. For example, the robot arm itself. We already display a graphical version of the arm that we draw based on outputs from the arm. When the arm and 3d points overlap, it's impossible to see exactly what's going on.
We can also remove things like the table the whole setup sits on and replace it with a plane. This will be easier to see, and give more contrast to items of interest, such as the blocks or pipe cleaners we will be using. The ground plane is simple enough to filter... since the coordinate system for the robot arm is what's used in the actual program. Anything below zero altitude is simply eliminated.
In the case of the robot arm, though, it's a little bit trickier. There are two ways I can think of to filter out the points. The first would be to put bounding boxes around the larger parts of the arm, then move the bounding boxes along with the arm and test to see if points are inside. The second method is to use OpenGL to render my relatively detailed model of the robot arm from the viewpoint of the stereo camera and in that way determine which pixels correspond to the robot arm. We have a mapping between 3D points and image pixels, so that's not the hard part. The second way is a little bit more nebulous to me... I'm not sure how difficult it will be. I do think the second way is better, though. Certainly faster, and probably more accurate as well. We can get a "closer trim" with the second method.
We can also remove things like the table the whole setup sits on and replace it with a plane. This will be easier to see, and give more contrast to items of interest, such as the blocks or pipe cleaners we will be using. The ground plane is simple enough to filter... since the coordinate system for the robot arm is what's used in the actual program. Anything below zero altitude is simply eliminated.
In the case of the robot arm, though, it's a little bit trickier. There are two ways I can think of to filter out the points. The first would be to put bounding boxes around the larger parts of the arm, then move the bounding boxes along with the arm and test to see if points are inside. The second method is to use OpenGL to render my relatively detailed model of the robot arm from the viewpoint of the stereo camera and in that way determine which pixels correspond to the robot arm. We have a mapping between 3D points and image pixels, so that's not the hard part. The second way is a little bit more nebulous to me... I'm not sure how difficult it will be. I do think the second way is better, though. Certainly faster, and probably more accurate as well. We can get a "closer trim" with the second method.
Dirty 3D rectangles!
It would be cool (and possibly useful) for the 3D scan in my interface to be dynamically updated in real-time. This is no small and trivial task, because it's a 640x480 image that the model is based on, so about 300k points. In the camera API, these points are stored in a big image where each point takes up 16 bytes (int, float, float, float). That's nearly 5 megabytes of data to sift through every camera update. And this camera can update at 30fps.
That's alotta data.
Well, it's a lot for stuff that needs to be processed in software, anyway. The bottleneck is not in transferring the data, mind you, it's in loading the data to the video card. Copying 5 megabytes to the video card 30 times per second, or even 5 times per second simply won't happen today.

So how do we solve this? I've been processing this at about 0.5% brainpower for the past several months, and today it dawned on me. I had been thinking all along of using vertex arrays, but I couldn't see a way to only update vertices that have changed. It didn't seem possible really. The answer is to use vertex arrays, but lots of vertex arrays instead of one. You break up the image into smaller pieces, so it's really something like 40x30 images that are each 16x16 pixels in size. Then you create a vertex array (or better, a vertex buffer object or VBO) for each sub-image. Then, if nothing is happening in a part of the image, you don't bother uploading that portion of data to the card.
This solution means that if the whole image changes, you have to upload the whole thing all over again, but for a camera that's sitting still, that should only happen if someone trips over the tripod, in which case you don't really care about the image anymore. If your camera is mounted on a mobile robot, the 3d display will only be dynamic if the robot's sitting still, and if it's looking at a static scene. For today's tasks and technology, these are reasonable constraints, I think.
Now that I've said all of this, I may or may not actually implement it. That all depends on how many grandkids I want to have by the time I graduate.
That's alotta data.
Well, it's a lot for stuff that needs to be processed in software, anyway. The bottleneck is not in transferring the data, mind you, it's in loading the data to the video card. Copying 5 megabytes to the video card 30 times per second, or even 5 times per second simply won't happen today.


This solution means that if the whole image changes, you have to upload the whole thing all over again, but for a camera that's sitting still, that should only happen if someone trips over the tripod, in which case you don't really care about the image anymore. If your camera is mounted on a mobile robot, the 3d display will only be dynamic if the robot's sitting still, and if it's looking at a static scene. For today's tasks and technology, these are reasonable constraints, I think.
Now that I've said all of this, I may or may not actually implement it. That all depends on how many grandkids I want to have by the time I graduate.
Monday, January 12, 2009
Second experiment

Because the results from my first study weren't jump out of your seat exciting and significant, we're working toward a second study to get at least a shift-weight-in-chair result.
The most important thing to fix is the alignment between the 3D scan and robot arm. Just last week I managed to fully incorporate the new Videre/SRI STOC camera including calibration, and the alignment is lots better than the SwissRanger. Still, something's slightly off, so I might just cheat a little. I'm thinking of artificially inserting points in the scan for where I know the object to be. This way, the critical points will be aligned perfectly, and I won't have to spend weeks developing methods to get a better alignment. It's really not a perfect solution though, because the robot arm model could still be off, but it does get rid of one point of error.
Another thing that needs doing is smooth arm control. Currently the arm sort of jerks around while moving in a "vroom screech" fashion. Today I worked in a solution that works well for the most part, but it feels hackish and I've noticed a problem that I think is actually due to noisy readings from my servo position feedback, but I never saw it with the jerky control. So of course my adviser comments that I should be using a PD controller for this. But today I equipped my t-shirt of minimum coding effort, so I followed the path I trod earlier to get this thing whipped up quick. And now that I've got a solution mostly working, I know that my adviser's right, and I'll probably have weird problems crop up or a lot of work to remove the quirks unless I make a PD controller (sigh, grumble, sigh). It's sad to throw away code, but sometimes it just has to be done.
That's really all of the hard stuff... there are lots of little things, like improving the subjective user surveys (Likert scale and all of that), but the big stuff is getting to be under control. I can definitely see the study starting up again within a couple weeks (knock knock knock).
Subscribe to:
Posts (Atom)