GSoC: Improve parsing performance in ODK JavaRosa

yanokwa · February 27, 2019, 5:47pm

JavaRosa currently uses kxml to parse XML. Certain operations perform poorly O(n²), drastically slowing down parsing large documents. A successful proposal for this project must:

Objectively benchmark poorly performing operations to identify root causes and to be able to measure improvements
Evaluate potential changes for risk/reward trade-offs
Sequence changes to ensure high-value changes can be safely made (e.g., increased test coverage, comparing output, feature toggles)

The selected student will implement the proposed changes.

Requirements: Java
Repository: https://github.com/opendatakit/javarosa
Difficulty: Hard
Mentor: Dave Briccetti (@dcbriccetti)

sandhyarishitha · March 11, 2019, 4:57am

Hi @dcbriccetti, I am sandhya .I am interested in applying for GSoc 2019 for ** Improve parsing performance in ODK JavaRosa **project and I want to work with this community.I have good knowledge in java and worked with data transfering.can you provide some more requirements or information to work on this project and can you give some tasks to work on this project to get more information.
Thank you..

yanokwa · March 12, 2019, 5:07am

@sandhyarishitha https://github.com/opendatakit/javarosa/blob/master/CONTRIBUTING.md is a good place to start. So is https://github.com/opendatakit/javarosa/issues?q=is%3Aissue+is%3Aopen+label%3A"good+first+issue".

sandhyarishitha · March 13, 2019, 6:56am

Thank you @yanokwa for replying!

harshitha_kamasani · March 19, 2019, 5:04am

Hi @dcbriccetti.
I'm K. Harshitha interested in Improving parsing performance in ODK JavaRosa project GSOC 2019 and having some good knowledge in Java. How can I proceed further to contribute to this project?
Thank you!

dcbriccetti · March 21, 2019, 5:57pm

Hi all! Welcome @sandhyarishitha and @harshitha_kamasani, and please excuse my delay in responding.

Here’s a list of JavaRosa’s Java files that use kxml. A starting point might be to understand how JavaRosa uses kxml. Then to create a proposal as @yanokwa outlined in this post.

dbl:~/devel/opendatakit/javarosa $ grep -rl --include '*.java' kxml .
./src/org/javarosa/core/model/actions/SetValueAction.java
./src/org/javarosa/core/model/instance/ExternalDataInstance.java
./src/org/javarosa/model/xform/CompactSerializingVisitor.java
./src/org/javarosa/model/xform/DataModelSerializer.java
./src/org/javarosa/model/xform/SMSSerializingVisitor.java
./src/org/javarosa/model/xform/XFormSerializingVisitor.java
./src/org/javarosa/xform/parse/ElementChildDeleter.java
./src/org/javarosa/xform/parse/FormInstanceParser.java
./src/org/javarosa/xform/parse/IElementHandler.java
./src/org/javarosa/xform/parse/IXFormParserFactory.java
./src/org/javarosa/xform/parse/RangeParser.java
./src/org/javarosa/xform/parse/StandardBindAttributesProcessor.java
./src/org/javarosa/xform/parse/SubmissionParser.java
./src/org/javarosa/xform/parse/XFormParseException.java
./src/org/javarosa/xform/parse/XFormParser.java
./src/org/javarosa/xform/parse/XFormParserFactory.java
./src/org/javarosa/xform/parse/XmlTextConsolidator.java
./src/org/javarosa/xform/util/InterningKXmlParser.java
./src/org/javarosa/xform/util/XFormAnswerDataSerializer.java
./src/org/javarosa/xform/util/XFormSerializer.java
./src/org/javarosa/xform/util/XFormUtils.java
./src/org/javarosa/xml/ElementParser.java
./src/org/javarosa/xml/TreeElementParser.java
./src/org/javarosa/xml/util/InvalidStructureException.java
./test/org/javarosa/xform/parse/ChildProcessingTest.java
./test/org/javarosa/xform/parse/SubmissionParserTest.java
./test/org/javarosa/xform/parse/XFormParserTest.java

sandhyarishitha · March 22, 2019, 3:27am

Tqq @dcbriccetti I will work on it

mdmahendri · March 30, 2019, 6:18am

Hello @yanokwa @dcbriccetti i want to ask something. Though this might irrelevant to the project, i just wanted to speak what I have in mind
Since parsing in JR is done by creating the DOM document, why JR uses kxml? I don't see any API in kxml which utilize random access and XPath to select element. Instead parsing in JR uses for-loop then check the element name. Because of that, I want to offer changing the parsing library (kxml)

Currently, two popular method to parse XML are DOM-based and SAX/StAX/Pull-parse. But there are another method recently called VTD and the library name is VTD-XML. VTD create representation by saving token offset, length, depth level and not by making DOM element tree. VTD basically parsing + indexing xml library

Advantages

better parsing throughput compared to DOM and SAX
efficient memory usage compared to DOM approach because no overhead of creating many DOM objects instead only array-like memory overhead once
hierarchy tree build only for element, so there is no need to check for leaves (attribute, text node)
select only elements that satisfy path, no need for loop-if and handling of unexpected elements
XPath evaluation is also fastest among other DOM library
can save VTD into disk so that parsing is only done once, though this might not needed because JR uses FormDef which is similar to DOM structure

So what i think is the possibility to migrate parsing library. After that, study whether VTD indexing scheme is suitable for other JR objects

Link to parsing throughput and memory usage benchmark
Link to XPath eval performance benchmark
Link on how VTD representations look like

LN · April 1, 2019, 7:24pm

Hi and welcome, @mdmahendri!

I'm sorry to say that VTD-XML would not be able to be leveraged because its license is GPLv2 (see https://sourceforge.net/projects/vtd-xml/) which is not compatible with ODK's Apache 2.0 license (see https://apache.org/licenses/GPL-compatibility.html).

I'll look at responding to some of your other questions and comments soon unless someone gets to it first!