We use ODK Collect and Aggregate to collect and submit data to Fusion Tables which is our main repository after collection. We are mostly a seasonal operation, with prime data collection occurring between roughly March to October of every year. We've been slowly becoming more reliant on it since 2011.
There have always been a few duplicate entries, but nothing major, and easy enough to remove. This year though, I've been seeing up to 100 to 200 duplicate entries in our bigger tables, around 1500 to 2000+ records. I've seen as many as 5 duplicates per record, although usually not more than 2 or 3.
I don't know what could have changed this year though. Most of the entries I've seen on the topic in the forum here suggest this was a problem mostly taken care of years ago.
What ODK tool and version are you using? And on what device and operating system version?
As far as I can tell, this has occurred while using Aggregate 1.6.x since March, and whichever Collect versions have been released during that time frame.
What steps can we take to reproduce the problem?
I am not sure. We just submit forms and get duplicates.
What you have you tried to fix the problem?
I'm not sure what to try. I'm hoping for suggestions. It's possible there is something external to ODK in our environment interfering with Fusion Table's publishing reply, but I have to start with ODK first.
I'm attaching a file with two GAE Application logs from Stackdriver Logging that I think recorded an instance of duplicate records being created today.
gae_app__logs__2018-11-02T01-00.json (62.9 KB)
@Greg_Pommen These sorts of problems are hard for us to track down because it's hard to reproduce. Most of the time, it's an issue on Google's end (of course, it's never our bug )
Can you give me a sense of size? How big are these forms? And how many records do you have? Does the data that ends up in Fusion Tables have IDs that you can use to dedup?
@yanokwa LOL, yes, always blame Google! In this case I can see it being Google, ODK might have something different in this year's versions, or perhaps it's our IT department's security software interfering somehow. But I have very little ability to contact the other two so I'll start with ODK!
Size of forms, does 157 to 556 lines of code help? We currently have 34 ODK forms active, with some having 0 records (no one seems to use those but then no duplicates either!), then ranging from 10 to 1320 records in 26 forms that are actually used. All of those have had duplicates. My apologies, we used to have over 2000 records in some tables but that was in previous years.
I use the meta-instance-id field to find the duplicates in the Fusion Tables. When there are many duplicates (10+), I export the CSV, eliminate the duplicates in that, then delete all the records in the affected Fusion Table, and import the cleaned file. When there are 100+ duplicates, I can't think of a better way to get rid of them.
You mentioned that your activity is seasonal. Do you start with a clean slate each season or do you reuse the same Aggregate database and/or Fusion Table instances?
Most are clean slate, but a few we have been using for years. We're getting the same duplication problem, regardless, Aggregate and Fusion Tables.
Unfortunately, I can't suggest anything at this moment about your problem
These kind of errors are tricky because there are many possible error points. Any error down the chain could trigger resending data even when data could be effectively being written in Fusion's end. We can't make Aggregate check whether data has already been sent to Fusion before sending it because that would consume quota much faster, and it could increase expenses too.
In any case, we've tried to stay clear from changes in core features of Aggregate since v1.4 because we're eventually replacing it with ODK Central. We have solved some web UI issues, we have improved security, and provided better VM/Docker alternatives, etc. but we haven't changed the publishers.
Looking at the logs you've attached, I can see some errors while Aggregate tries to acquire write locks on the database, but there are some recovery mechanisms in place for that and it doesn't look like the source of the issue you're experiencing.
Is there any other error on the logs?
We're trying to figure out what to do with this issue... Would you be so kind to write some lines explaining why you're using Fusion Tables among all the other alternatives to take data out from Aggregate? How do Fusion Tables fit in your data workflow?
Problems: ODK doesn't work well within our corporate environment since we need an external Google account to run the AppEngine, and therefore the Fusion Tables too. Google's support for Fusion Tables seems to be iffy, and it hasn't been updated in years. One reason I have to use the Fusion Tables API for custom websites is because Fusion Tables do not display on smartphones very well. We never had a problem with duplicates before this year. Now they are problem that I'm not looking forward to solving in the future.
@Greg_Pommen I've got some emails out to try to get some face time with the Fusion Tables team and I'll update this topic once I've made contact. Until that happens, I'd like to explore some alternatives...
- Do you think you'd be able to use Google Sheets instead of Fusion Tables? The limitations that I see are support for repeats and very large data sets, but we might be able to flatten data in Aggregate and Google Sheets now supports 5 million cells.
- Would something like ODK Central's OData support help? I'm not proposing you switch. I'm just curious that functionality addresses your use-case.
@yanokwa I'm assuming you've seen the very recent notice that Fusion Tables are being discontinued. The biggest advantage of Fusion Tables is their ability to do simple real time mapping and so inexpensively. I don't know how any of the alternatives can address this yet. We are going to have to do a complete rethink of how we do things here. Unfortunately I'm on seasonal layoff at the moment so I won't be able to investigate them for awhile. Just looking very quickly at the OData support, I notice it can link to Tableau, which we do have. Unfortunately, due to licensing cost issues, last I checked, we can't do real time mapping with that either! Anyway, it doesn't sound like it's worth wasting too much of your time on Fusion Table issues anymore. It's too bad, Fusion Tables were a rather useful and unique offering. I was hoping Google would have kept developing them. Once again though, thank you all very, very much for your help!
@Greg_Pommen I did see the announcement and also put it at Google Fusion Tables to be shut down on December 3, 2019.
The short of it is that the deprecation of Fusion Tables isn't a big surprise. I've had a conversation with some folks at Google and everyone I've spoken with is committed to help those of you who use it with Aggregate have a soft landing.
The most important thing we need at this moment is a clear understanding of what Fusion Tables offered (it sounds like mapping) so we can help come up with reasonable alternatives. That said, I think free mapping is going the way of the dodo, so keep that in mind.
@yanokwa I had a feeling something was up with Fusion Tables. However, we've had a good run with them. It helped us get out of the dark ages of using paper for everything, i.e. collect useless piles of paper. It should be easier (somewhat) to get approval for budgets for digital data collection and mapping now that we've proved the worth of such things. Depending on how everything shakes out, you may be getting even more questions from me soon! Thanks again!