GSoC: Improve parsing performance in ODK JavaRosa

JavaRosa currently uses kxml to parse XML. Certain operations perform poorly O(n²), drastically slowing down parsing large documents. A successful proposal for this project must:

  • Objectively benchmark poorly performing operations to identify root causes and to be able to measure improvements
  • Evaluate potential changes for risk/reward trade-offs
  • Sequence changes to ensure high-value changes can be safely made (e.g., increased test coverage, comparing output, feature toggles)

The selected student will implement the proposed changes.

Requirements: Java
Repository: https://github.com/opendatakit/javarosa
Difficulty: Hard
Mentor: Dave Briccetti (@dcbriccetti)

2 Likes

Hi @dcbriccetti, I am sandhya .I am interested in applying for GSoc 2019 for ** Improve parsing performance in ODK JavaRosa **project and I want to work with this community.I have good knowledge in java and worked with data transfering.can you provide some more requirements or information to work on this project and can you give some tasks to work on this project to get more information.
Thank you..

@sandhyarishitha https://github.com/opendatakit/javarosa/blob/master/CONTRIBUTING.md is a good place to start. So is https://github.com/opendatakit/javarosa/issues?q=is%3Aissue+is%3Aopen+label%3A"good+first+issue".

Thank you @yanokwa for replying!

Hi @dcbriccetti.
I'm K. Harshitha interested in Improving parsing performance in ODK JavaRosa project GSOC 2019 and having some good knowledge in Java. How can I proceed further to contribute to this project?
Thank you!

Hi all! Welcome @sandhyarishitha and @harshitha_kamasani, and please excuse my delay in responding.

Here’s a list of JavaRosa’s Java files that use kxml. A starting point might be to understand how JavaRosa uses kxml. Then to create a proposal as @yanokwa outlined in this post.

dbl:~/devel/opendatakit/javarosa $ grep -rl --include '*.java' kxml .
./src/org/javarosa/core/model/actions/SetValueAction.java
./src/org/javarosa/core/model/instance/ExternalDataInstance.java
./src/org/javarosa/model/xform/CompactSerializingVisitor.java
./src/org/javarosa/model/xform/DataModelSerializer.java
./src/org/javarosa/model/xform/SMSSerializingVisitor.java
./src/org/javarosa/model/xform/XFormSerializingVisitor.java
./src/org/javarosa/xform/parse/ElementChildDeleter.java
./src/org/javarosa/xform/parse/FormInstanceParser.java
./src/org/javarosa/xform/parse/IElementHandler.java
./src/org/javarosa/xform/parse/IXFormParserFactory.java
./src/org/javarosa/xform/parse/RangeParser.java
./src/org/javarosa/xform/parse/StandardBindAttributesProcessor.java
./src/org/javarosa/xform/parse/SubmissionParser.java
./src/org/javarosa/xform/parse/XFormParseException.java
./src/org/javarosa/xform/parse/XFormParser.java
./src/org/javarosa/xform/parse/XFormParserFactory.java
./src/org/javarosa/xform/parse/XmlTextConsolidator.java
./src/org/javarosa/xform/util/InterningKXmlParser.java
./src/org/javarosa/xform/util/XFormAnswerDataSerializer.java
./src/org/javarosa/xform/util/XFormSerializer.java
./src/org/javarosa/xform/util/XFormUtils.java
./src/org/javarosa/xml/ElementParser.java
./src/org/javarosa/xml/TreeElementParser.java
./src/org/javarosa/xml/util/InvalidStructureException.java
./test/org/javarosa/xform/parse/ChildProcessingTest.java
./test/org/javarosa/xform/parse/SubmissionParserTest.java
./test/org/javarosa/xform/parse/XFormParserTest.java

Tqq @dcbriccetti I will work on it

Hello @yanokwa @dcbriccetti i want to ask something. Though this might irrelevant to the project, i just wanted to speak what I have in mind
Since parsing in JR is done by creating the DOM document, why JR uses kxml? I don't see any API in kxml which utilize random access and XPath to select element. Instead parsing in JR uses for-loop then check the element name. Because of that, I want to offer changing the parsing library (kxml)

Currently, two popular method to parse XML are DOM-based and SAX/StAX/Pull-parse. But there are another method recently called VTD and the library name is VTD-XML. VTD create representation by saving token offset, length, depth level and not by making DOM element tree. VTD basically parsing + indexing xml library

Advantages

  • better parsing throughput compared to DOM and SAX
  • efficient memory usage compared to DOM approach because no overhead of creating many DOM objects instead only array-like memory overhead once
  • hierarchy tree build only for element, so there is no need to check for leaves (attribute, text node)
  • select only elements that satisfy path, no need for loop-if and handling of unexpected elements
  • XPath evaluation is also fastest among other DOM library
  • can save VTD into disk so that parsing is only done once, though this might not needed because JR uses FormDef which is similar to DOM structure

So what i think is the possibility to migrate parsing library. After that, study whether VTD indexing scheme is suitable for other JR objects

Link to parsing throughput and memory usage benchmark
Link to XPath eval performance benchmark
Link on how VTD representations look like

2 Likes

Hi and welcome, @mdmahendri!

I'm sorry to say that VTD-XML would not be able to be leveraged because its license is GPLv2 (see https://sourceforge.net/projects/vtd-xml/) which is not compatible with ODK's Apache 2.0 license (see https://apache.org/licenses/GPL-compatibility.html).

I'll look at responding to some of your other questions and comments soon unless someone gets to it first!

2 Likes