We managed to solved this problem by using pulldata() instead of instance().
Background:
When Form Definition contains many instance() lookup and the associated secondary instance (entity-list) contains thousands of rows then Enketo needs iterate through the list many times, which takes time. If Enketo is able to complete all the lookups with 30 seconds then user will see the media/images on the Form otherwise they won't. The time taken by Enketo is dependent on the number of rows in the secondary instance, number instance() functions in the Form definition, user's machine, OS and browser as well.
To circumvent this problem, Form definition can be updated to use pulldata() instead of instance(). In Enketo, pulldata uses browsers default xpath evaluator, which is generally faster, see more details at Difference or use of 'pulldata' and 'instance' - #2 by LN