Analyzing MobileHunter

Background

Investigative journalists from NDR, Süddeutsche Zeitung (SZ, for short), and multiple other international teams of journalists managed to get hold of a surveillance app that is used by Chinese border officials to scan mobile phones of people entering the country from Kyrgysztan. This sparked a collaboration with my advisor Prof. Thorsten Holz, who leads the Chair for Systems Security at Ruhr-Universität Bochum, and myself to unveil the inner workings of the app and figure out what exactly the app searches for.

This post aims to shed some more light on our results.

Note: A few days prior to this publication it has come to our attention that the penetration testers at Cure53 were independently tasked with analyzing this very application. As they have a track record of excellent work, please make sure to check out their report as well.

App Structure

We were given the app’s installer for Android, which came in form of a regular APK file.

The app itself has an interesting structure – in addition to the regular, to-be-expected Java layer, it comes with a set of binary assets, listed in the following:

File NameDescription
bk_samples.binEncrypted file
gen_wifi_cj_flag[_pie]ELF executable
getVirAccountELF executable
id.confText file containing regular expressions
terrorism_apps.csvEmpty file (for our APK)
wifiscan[_pie]ELF executable

One thing worth noting is that it ships both PIE and non-PIE versions of its executables. PIE stands for position-independent executable and has been enforced on Android since version 5.0 (Lollipop) in late 2014. In shipping non-PIE binaries as well, the authors ensure to maintain compatibility with older versions of Android that do not support PIE yet (namely, any version prior to Android 4.1).

As we shall see later, the binaries are one of the more interesting components of this app. We will revisit these in a bit, but let’s first get a bird’s-eye view on how the app operates.

Dynamic Analysis

For our initial analysis, we used a Huawei P10 running Android 7.0 and installed the APK. The phone contains a few contacts and only comes with the pre-installed apps.

In order to prevent any outbound connections, we first ran the app in a shielding box and activated airplane mode. We were greeted with the following screen:

The main screen of the app when opening it in airplane mode.
The main screen of the app when opening it in airplane mode.

Obviously, the app seems to be expecting an active connection to a particular WiFi hotspot. Dimly in the background, one can make out that it wants to connect to a local IP in the 192.168.43.* subnet. Later analysis revealed that it attempts to connect to 192.168.43.1 on port 8080. (In more detail: it obtains the local IP and sets the last octet to .1, but explicitly checks for the 192.168.43.* subnet when deciding whether to issue this warning.)

Luckily, the app does not seem to connect to any external servers which eased our analysis.

Checking the Phone

Conceptually, the app is rather simple: one button offers us to start checking our phone and another one allows us to uninstall the app after it has performed its task. In part, this super simple setup might be the attempt to ease the work of the border officials: given an unlocked phone, all they need to do is install the APK, connect to their WiFi hotspot and start scanning the phone with the press of a button.

Once the scan has completed, the app tries to upload a report to the local WiFi hotspot. The main screen now gives additional details, such as the number of files scanned, and lets us know that no suspicious files were “hitted” [sic]. This is no surprise since our phone is virtually empty. As we are in airplane mode, the app complains that it cannot upload the report and offers to try again. At this point, we set up a WiFi hotspot listening on the aboove-mentioned address and captured the resulting report, which comes in form of a regular ZIP file.

Initial Report

Even without any notable content on our phone, the report already contains a plethora of information:

File NameDescription
app_listList of installed applications
AppParse.propHardware information (model, CPU, board, hardware, and device)
Calendar.xmlList of calendar entries
Contact.xmlList of contacts
contact{n}.jpgContact pictures, sequentially numbered
Dialing.xmlList of phone calls (esp. name, number, duration, time, and duration)
Messages.xmlList of SMS messages
phone.txtEmpty, as our phone does not have a SIM equipped
PhoneData.chaGeneral information about the device and the scan itself
report.htmlFormatted HTML report with a subset of the data above

In the following, we will discuss two more interesting entries, app_list and PhoneData.cha.

Contents of app_list

This file is created with the help of Android’s PackageManager API and lists for each installed application the app name, package name, version, code size, time of installation, path, and MD5 hash. The analysis is thorough enough to even include itself:

蜂采	com.fiberhome.wifiserver	installed	1.0	4041059	1561643875	/data/app/com.fiberhome.wifiserver-1/base.apk	1	null	8ddb342f2da5408402d7568af21e29f9	null

Contents of PhoneData.cha and report.html

Amongst other information, the PhoneData.cha file contains the phone’s manufacturer, model, Android version, WiFi and Bluetooth MAC addresses, IMEI, and (if a SIM card is present) its IMSI. It also performs a rudimentary root detection by checking for the presence of either /System/bin/su or /System/xbin/su.

Some of this information, along with the messages, contact, and dialing logs is duplicated in report.html.

Interestingly, PhoneData also contains an entry labelled DeviceName with the value MobileHunter. In contrast, the rendered report.html is captioned CellHunter Reporter. We can assume either of these strings indicate the original application’s name.

Static Analysis

Even though the initial report already contains a disturbingly detailed compilation of sensitive information on our phone, there is more to the story.

In order to verify the information in the report, we decompiled the Java layer of the app. Fortunately, there were no attempts made to obfuscate the code or hinder analysis in any other way. For the most part, the code is straight-forward: it uses well-known Android APIs to collect the information provided in the report. At this point, however, we were more interested in how the binary assets we discovered early on were used by the app.

Helper Binaries

Soon, we came across the following piece of code:

String string2 = WelcomeActivity.this.getResources().getString(2131165184);
if (string2.contains("true")) {
    if (Build.VERSION.SDK_INT >= 16) {
        ShellCommands.doSuCmds("sh", Global.absolutefilesPath_ + "/wifiscan_pie sm " + WelcomeActivity.this.sdP + " 2>" + Global.absolutefilesPath_ + "/error_file 1>" + Global.esnPath_ + "scandir_temp");
    } else {
        ShellCommands.doSuCmds("sh", Global.absolutefilesPath_ + "/wifiscan sm " + WelcomeActivity.this.sdP + " 2>" + Global.absolutefilesPath_ + "/error_file 1>" + Global.esnPath_ + "scandir_temp");
    }
}
if (!"true/false".equals(string2)) {
    ShellCommands.doSuCmds("sh", Global.absolutefilesPath_ + "/getVirAccount " + Global.absolutefilesPath_ + "/id.conf " + Global.esnPath_ + "app_account");
}

Based on configuration values found in its resources, the app spawns two of the binary files found in its data directory: wifiscan and getVirAccount. Notably, for the wifiscan binary, it chooses between the PIE and non-PIE variants based on Android’s SDK level, which serves as an indicator as to whether PIE is supported by the installed Android version.

The choice to directly invoke helper programs instead of using Java Native Interfaces strikes us as odd. Further, it remains unclear for which reason native binaries are used after all – we can only suspect this might be due to performance issues or even the hope of obscuring the logic a bit more, as native code is a bit harder to analyze than plain decompiled Java code.

During startup, the app looks up several important locations which are used when invoking the helper binaries:

In summary, wifiscan is passed the parameter sm as well as all known SD card paths. Its output is redirected to a file called scandir_temp which will ultimately be added to the report. The app invokes getVirAccount by passing the path to its configuration, id.conf, and an output path to a file named app_account that is also included in the final report.

getVirAccount

getVirAccount is a stripped 32-bit ELF executable for ARM (EABI5). Although the industry standard disassembler, IDA Pro, recognizes a bit more than 1,200 functions, the binary itself isn’t too complex. Most complexity stems from the fact that it is a C++ executable and makes heavy use of the STL.

Interestingly enough, none of the associated binary files support any other architecture than ARM. Arguably, the vast majority of Android devices ship with an ARM processor nowadays, but this app might still miss some of the more obscure devices.

The binary spends the majority of its time parsing its configuration file, id.conf. As it turns out, this file is rather self-explanatory and the binary more or less does exactly what one would expect it to do:

#包名\t路径名\t获取方式
#获取方式DIR FILE FILE_CONTENT
com.tencent.mobileqq	tencent/MobileQQ/	DIR	(^[1-9][0-9]+)
com.tencent.mobileqq	Tencent/MobileQQ/	DIR	(^[1-9][0-9]+)
com.tencent.mobileqq	tencent/QWallet/	DIR	(^[1-9][0-9]+)
com.tencent.mobileqq	Tencent/QWallet/	DIR	(^[1-9][0-9]+)
com.renren.mobile.android	Android/data/com.renren.mobile.android/cache/talk_log/	FILE	talk_log_([0-9]+)_.*
com.duowan.mobile	yymobile/logs/sdklog/	FILE_CONTENT	logs-yypush_.*txt	safeParseInt ([0-9]*)
com.immomo.momo	immomo/users/	DIR	(^[1-9][0-9]+)
cn.com.fetion	Fetion/Fetion/	DIR	(^[1-9][0-9]+)
com.alibaba.android.babylon	Android/data/com.alibaba.android.babylon/cache/dataCache/	FILE	(^[1-9][0-9]+)
#"phone":"18551411***"
com.sdu.didi.psnger	Android/data/com.sdu.didi.psnger/files/omega	FILE_CONTENT	e.cache	"phone":"([0-9]*)"
#aaaa
com.sankuai.meituan	Android/data/com.sankuai.meituan/files/elephent/im/	DIR	(^[1-9][0-9]+)
com.sogou.map.android.maps	Android/data/com.sogou.map.android.maps/cache/	FILE_CONTENT	cache	"a":"([^"]*)"
#com.sina.weibo	loginname=red***@163.com&
com.sina.weibo	sina/weibo/weibolog/	FILE_CONTENT	sinalog.*txt	loginname=([^&]*)&

Lines starting with # indicate a comment. All other lines fall into one of the following categories:

TypeDescription
DIRExtract the name of the directory inside the given path.
FILEExtract the name of the file inside the given path.
FILE_CONTENTExtract the contents of a specific file.

Each line provides a package name of the app in question, the path to its associated data (and, for FILE_CONTENTS, a file name) as well as a regular expression that extracts a part of the name or content. Each match yields a line in the report file app_account, containing the package name followed by the extracted identifier.

Overall, this mechanism is used to extract account identifiers of various popular chinese social media apps. Notably, no passwords, session tokens, or other kinds of information are extracted that would allow logging into the account. We can only speculate that account identifiers may be correlated to potentially suspicious account activity by another mechanism external to this app.

Finally, the omission of Tencent’s popular messaging app WeChat is interesting; especially since other Tencent apps are already included in the configuration.

wifiscan

The wifiscan PIE and non-PIE binaries both are, just as getVirAccount, 32-bit ELF executables for ARM EABI5. They are written in C++ as well and make use of the STL. As parameters, they accept a set of modes as well as paths that are to be scanned.

The first thing that comes to our attention is the fact that the binary supports more modes than are actually hardcoded on the Java side (s and m; presumably scan and match). In the following, we will describe the binary’s behavior for this configuration and will get back to the other modes shortly.

bk_samples.bin

After verifying its command line arguments, the binary soon starts reading in the binary file bk_samples.bin which we discovered earlier in the assets directory of the app. As evident from the code, as well as debug outputs which were helpfully left in the binary, the binary file is encrypted and needs to be decrypted first. The encryption scheme is a pretty standard AES implementation, with some minor adjustments such as obscuring the static symmetric key; presumably to make extracting it a bit harder.

What is interesting about the symmetric key is that it is twice as large as actually used by the AES implementation. Since the key is an ASCII representation of hexadecimal digits, we are lead to believe that the authors originally meant to convert the key into a binary representation instead of simply throwing away the second half. This is a common mistake when handling different representations of binary data.

The decrypted bk_samples.bin starts as follows:

135510055
E624931E72EB7D0736B8E43BE9BBA4B6
8765440
3A78017C9F0B948EE8B99F7CD9D0A359
868352
16FB644579B95CB73B80C75C381D14AC
2029879
790F89DDD4C74C5C97F59BB32C5E64F3
5210112
B229B6C4DDB12C59E3D2F061179A1B4B
59172363
12FEBEDF9B5F31469629244DC3444F96
...

Further analysis makes it obvious that this is a database of file sizes along with the expected MD5 hash of the file. Every two lines form a database entry, a tuple (size, hash). It is used to somewhat uniquely describe a file with a size of size bytes and the MD5 checksum hash.

Overall, this database consists of no less than 73,315 entries. (Side note: The authors could have reduced its file size tremendously by storing it as binary data instead of plain ASCII.)

Matching Process

The database is then used as follows: for every path that was passed to the wifiscan binary, a recursive directory traversal is performed. If any file is visited which has a size that is present in the database, its MD5 checksum is computed. If the checksum matches the MD5 checksum for that particular entry, the file is being considered a hit. By checking the file size before actually computing the MD5 the process is being sped up considerably.

For every hit, metadata is printed that is later stored in the report. Also, the main Java code issues a beep tone to indicate to the border officer that a suspicious file has been observed and the number of matches is counted in the UI:

The app's UI after discovering six blacklisted files.
The app's UI after discovering six blacklisted files.

Metadata printed for each match includes the following:

The output for a match might look like the following:

3	1fa261535eb0a3ad53ab499c93a40092f919db25374d081e1aa22a703df48a50.pdf	5460831	/storage/emulated/0/1fa261535eb0a3ad53ab499c93a40092f919db25374d081e1aa22a703df48a50.pdf	B9AA0AB31F184EE23A336B4B3B804835	pdf	1561639203	1561639202

With this information, the border officer can exactly pinpoint the suspicious file on the users phone, including how frequently it was actually used.

Deletion of pe*.apk

During the directory traversal, if the binary ever encounters a file whose file name starts with pe and ends on .apk, the file is deleted.

For a regular user, it is somewhat unlikely to find APKs stored on the SD card. This leads us to believe that this could be some sort of cleanup mechanism by the app: assuming the installer for this very app was named according to the above-mentioned scheme, a simple run of the app would suffice to delete its own installer that was previously downloaded to its SD card. When subsequently pressing the Uninstall button, nearly no trace of the app would be left on the users phone.

cjlog.txt

Having performed the scan of the phone’s file, the app writes an obfuscated log file to a folder called Android on the SD card. It contains the fact whether any suspicious files have been hit as well as the timestamp of the last scan. Even after uninstalling the app, this file remains on the phone.

To prevent exposing this information directly, the file is obfuscated by generating randomness via lrand48 (seeded by the infamous time(0), a weak source of entropy), hashing it using MD5, and finally xor-combining it with the actual data. The randomness is stored alongside the obfuscated bytes.

While this scheme can be easily reversed, there actually is no need to do so: the app ships a dedicated program called gen_wifi_cj_flag that performs this task. It generates a file called cjlog_plain.txt.

Unused Functionality

The wifiscan binary contains more functionality than is used by the outer Java layer, the reason for which we can only speculate about. For one, it has additional code that handles entries in bk_samples.bin with a size of above one MiB (this holds for 47,221 entries in total; for mode D). We did not look further into this as it was not used in our configuration.

Additionally, the binary provides even more scanning modes; namely p (pictures), v (videos), and d (documents, respectively). Each mode is assigned a hardcoded list of file extensions:

TypeExtensions
Pictures.tiff, .tif, .png, .jpg, .bmp, .jpeg, .cr2, .gif
Videos.3gp, .aac, .amr, .flac, .m4a, .asf, .wmv, .avi, .flv, .f4v, .f4a, .f4b, .f4p, .riff, .mkv, .mk3d, .mka, .mks, .mov, .qt, .mpeg, .mpg, .m2p, .ps, .mp4, .m4a, .m4p, .m4b, .m4r, .m4vi, .ogg, .ogv, .oga, .ogx, .spx, .opus, .rmvb, .rm, .dvd, .mts, .swf, .mp3, .m4v, .wav
Documents.txt, .doc, .docx, .pdf, .ppt, .pptx, .xls, .xlsx, .zip, .rar, .xml, .apk

During scanning, these files are not matched against the database, but their name is simply printed if any of the extensions match.

Recovering Entries

Faced with a list of opaque MD5 hashes, we were wondering what content the app actually searched for. The smallest file that is referenced in the database has a size of 31 byte (with the MD5 hash C0B8B4D706388E31C453B993015DF521), which is still way beyond what we can hope to brute-force in due time. We started compiling some interesting word lists, but these didn’t help either. Fortunately for us, all hashes in the database are regular MD5 hashes, i.e., the algorithm uses default initialization values. Hence, we can query external databases of known MD5 hashes – the most interesting being VirusTotal. Although its primary use case lies on malicious files, the sheer amount of data uploaded to it makes it one of the more promising collections of files and their accompanying checksums.

Together with the team of investigative journalists we queried VirusTotal. In the end, we managed to identify more than 1,400 files. While this might sound like a lot of files, this still only accounts for roughly 1.9% of the database. The investigative team, together with colleagues from The Guardian and The New York Times, then analyzed and categorized the content we unveiled in more detail.

Quickly, it became apparent that much material is related to Islamist propaganda. This is no surprise, as we initially also discovered an asset aptly named terrorism_apps.csv. The file, however, was empty in the version of the app we analyzed. Still, their efforts also uncovered the presence of a document about the Dalai Lama as well as a file containing rock music of a Japanese band.

We’d like to refer the interested reader to the publications of the respective investigative teams which will discuss their findings in more detail.

Conclusion

Albeit the app conceptually is rather simple, it collects a plethora of personal data. While we were able to get a glimpse into which data the app collects and, moreover, which files it searches for, the majority still remains unknown at this point.

Acknowledgements

We would like to thank David Rupprecht for aiding with the hardware setup.

HGI CASA








Revision History

DateDescription
2019-07-04Clarified IP handling, added links
2019-07-03Updated affiliation, added links
2019-07-02Initial version