How to find the location of text inside an image?

Back in the days when I was very new to automation and with the Appium tool I had a very interesting problem. “How do I click on the link with ‘Help’ in an Android App”.

While the problem sounds very easy, the way links are embedded in the TextView in Android development is that the text doesn’t have an identifier to it. Here’s an example of how the implementation code looks like

TextView link = (TextView) findViewById(R.id.hyper);
String linkText = "<a href='http://example.com/help'>Click here</a> in case of fire";
link.setText(Html.fromHtml(linkText));
link.setMovementMethod(LinkMovementMethod.getInstance());

The UiAutomatorViewer treats it as a normal text. Nowadays, with expresso, it’s relatively easy to find the location of the text then perform a touch operation. And your task is done.

However, there is no way to do that when using the UiAutomator mode in Appium. So our solution is

Take a screenshot of the screen where the desired text is present.
Read all the texts in the screenshot.
Find the coordinates of the desired text.
Perform a touch operation. (and Hope for the profit :) )

Now that we have a plan to solve our problem let’s follow it

1. Take the screenshot

Depending on which programming language you are using take the screenshot, for me it was Python.

1	`driver.save_screenshot(destination_file_path)`

2. Read the all text in the screenshot image

Here things get more interesting. The action we are performing is not the scope of Appium, so we have to think out of the box. Think OCR and Tesseract is one of the best available open-source OCR libraries.

So let’s read the text from with Tesseract. But wait, there’s a problem. You’ll see that when using a coloured image, Tesseract can’t read all the texts. To fix that we’ll convert the image to black & white or in other words grayscale. That will give the best image data, which Tesseract can process with a much better success rate.

Here’s how the full implementation looks like

import pytesseract
from pytesseract import Output
import cv2

text_pos_list = {}

# the colored image doesn't produce good OCR result
# img = cv2.imread('image.png')


# Let convert the image to grayscale
img = cv2.imread("image.png", cv2.IMREAD_GRAYSCALE);
(thresh, img_bw) = cv2.threshold(img, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
# you can choose to invert the color it the are in light color and background is dark color
# infact read the text in both normal grayscale and inverted grayscale for better OCR result
img = cv2.bitwise_not(img_bw)

# read the text, but aloso get all the text property of the text
d = pytesseract.image_to_data(img, output_type = Output.DICT)
# all collected text
n_boxes = len(d['level'])

for i in range(n_boxes):
  text = d['text'][i].strip()
  if len(text) == 0:
   continue
  (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
  text_pos_list[text] = (x, y, w, h)

3. Find the coordinates of the desired text.

This is where you can get the location coordinates of the desired text and perform the actions and you are done. However for debugging perpose you can draw a rectangel around the text

for i in range(n_boxes):
    (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
    cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)

# And plot the image for instant preview
cv2.imshow('img', img)
# Press any key to dismiss the preivew window
cv2.waitKey(0)

4. You have a coordinat correspoing to the text you were looking for, you can perform a touch operation. e.g.

1	`TouchAction(driver).tap(None, x, y, tap_count).perform()`

If you are still stuck with a project where using Espresso mode in Appium in not possible or you use some other automation tool that doesn’t have support for clicking on the text hope it help. Or you can solve some other problem where this solution can be applied or improvised.