JavaRosa currently uses kxml to parse XML. Certain operations perform poorly O(n²), drastically slowing down parsing large documents. A successful proposal for this project must:
Objectively benchmark poorly performing operations to identify root causes and to be able to measure improvements
Evaluate potential changes for risk/reward trade-offs
Sequence changes to ensure high-value changes can be safely made (e.g., increased test coverage, comparing output, feature toggles)
The selected student will implement the proposed changes.
Hi @dcbriccetti, I am sandhya .I am interested in applying for GSoc 2019 for ** Improve parsing performance in ODK JavaRosa **project and I want to work with this community.I have good knowledge in java and worked with data transfering.can you provide some more requirements or information to work on this project and can you give some tasks to work on this project to get more information.
Thank you..
Hi @dcbriccetti.
I'm K. Harshitha interested in Improving parsing performance in ODK JavaRosa project GSOC 2019 and having some good knowledge in Java. How can I proceed further to contribute to this project?
Thank you!
Here’s a list of JavaRosa’s Java files that use kxml. A starting point might be to understand how JavaRosa uses kxml. Then to create a proposal as @yanokwa outlined in this post.
Hello @yanokwa@dcbriccetti i want to ask something. Though this might irrelevant to the project, i just wanted to speak what I have in mind
Since parsing in JR is done by creating the DOM document, why JR uses kxml? I don't see any API in kxml which utilize random access and XPath to select element. Instead parsing in JR uses for-loop then check the element name. Because of that, I want to offer changing the parsing library (kxml)
Currently, two popular method to parse XML are DOM-based and SAX/StAX/Pull-parse. But there are another method recently called VTD and the library name is VTD-XML. VTD create representation by saving token offset, length, depth level and not by making DOM element tree. VTD basically parsing + indexing xml library
Advantages
better parsing throughput compared to DOM and SAX
efficient memory usage compared to DOM approach because no overhead of creating many DOM objects instead only array-like memory overhead once
hierarchy tree build only for element, so there is no need to check for leaves (attribute, text node)
select only elements that satisfy path, no need for loop-if and handling of unexpected elements
XPath evaluation is also fastest among other DOM library
can save VTD into disk so that parsing is only done once, though this might not needed because JR uses FormDef which is similar to DOM structure
So what i think is the possibility to migrate parsing library. After that, study whether VTD indexing scheme is suitable for other JR objects